# Homework 1 (Due Monday March 23rd, 2020 at 11:59pm PST)

Every day late is -10%.

You are a business analyst working for a major US toy retailer:

* A manager in the marketing department wants to find out the most frequently used words in positive reviews (five stars) and negative reviews (one star) in order to determine what occasion the toys are purchased for (Christmas, birthdays, and anniversaries.). He would like your opinion on **which gift occasions (Christmas, birthdays, or anniversaries) tend to have the most positive reviews** to focus marketing budget on those days.

* One of your product managers suspects that **toys purchased for male recipients (husbands, sons, etc.)** tend to be much more likely to be reviewed poorly. She would like to see some data points confirming or rejecting her hypothesis. 

* Use **regular expressions to parse out all references to recipients and gift occassions**, and account for the possibility that people may spell words "son" / "children" / "Christmas" as both singular and plural, upper or lower-cased.

* Explain what some of pitfalls/limitations are of using only a word count analysis to make these inferences. What additional research/steps would you need to do to verify your conclusions?

* Create a simple Excel CSV file that contains 2-3 lines at most describing yourself, your background, and interests. It must contain at least 1 emoji and 4-5 international characters (non-ASCII). Make sure to properly encode the file so that I can open it in `UTF-8` to read. Attach it to your submission.

Perform the same word count analysis using the reviews received from Amazon to answer your marketing manager's question. They are stored in two files, (`poor_amazon_toy_reviews.txt`) and (`good-amazon-toy-reviews.txt`). **Provide a few sentences with your findings and business recommendations.** Make any assumptions you'd like to- this is a fictitious company after all. I just want you to get into the habit of "finishing" your analysis: to avoid delivering technical numbers to a non-technical manager.

**Submit everything as a new notebook and Slack direct message to me (Yu Chen) the HW as an attachment.**

`NOTE`: Name the notebook `lastname_firstname_HW1.ipynb`.

## Part 0: Importing and Preparing data


In [1]:
import pandas as pd
from collections import Counter
import re

In [2]:
# Opening two files
poor_review = open("poor_amazon_toy_reviews.txt", "r", encoding = 'UTF-8')
good_review = open("good_amazon_toy_reviews.txt", "r", encoding = 'UTF-8')
poor_review_lines = poor_review.readlines()
good_review_lines = good_review.readlines()


#### First, I created 2 dataframes (one for each of the review texts) that contains the count of each word that appear in the reviews. In order to make the analysis easier down the road, I changed all of the words into lowercase.

In [3]:
def count_words(lines, delimiter=" "):
    words = Counter()
    for line in lines:
        for word in line.split (" "):
            word = word.lower() #this makes all the words lowercase
            words[word] +=1 
    return words

In [4]:
#Poor Review Word Count
bad_review_df = pd.DataFrame(columns = ['word','frequency'])
bad_review_df['word'] = list(count_words(poor_review_lines).keys())
bad_review_df['frequency'] = list(count_words(poor_review_lines).values())
bad_review_df.sort_values(by='frequency', ascending=False, inplace=True)
bad_review_df.head()

Unnamed: 0,word,frequency
15,the,21935
14,and,11190
121,it,10486
8,i,10231
63,to,9688


In [5]:
#Good Review word count
good_review_df = pd.DataFrame(columns = ['word','frequency'])
good_review_df['word'] = list(count_words(good_review_lines).keys())
good_review_df['frequency'] = list(count_words(good_review_lines).values())
good_review_df.sort_values(by='frequency', ascending=False, inplace=True)
good_review_df.head()

Unnamed: 0,word,frequency
15,the,111394
28,and,88165
50,a,68776
14,to,63788
42,it,51931


#### Next, I wanted to see how many reviews exist in each of the files. By reading the actual file, I noticed that each review is separated by \n. Therefore, I found the total number of \n and assumed it as the total number of reviews. In order to gain further confidence about this assumption, I manually opened both text files outside of python and confirmed that the number of reviews match the numbers shown below.

In [6]:
poor_review = open("poor_amazon_toy_reviews.txt", "r", encoding = 'UTF-8')
good_review = open("good_amazon_toy_reviews.txt", "r", encoding = 'UTF-8')
poor_review_read = poor_review.read()
good_review_read = good_review.read()

In [7]:
poor_review_read



In [8]:
good_review_count = len(re.findall(r'\n',good_review_read, flags = re.IGNORECASE))
poor_review_count = len(re.findall(r'\n',poor_review_read, flags = re.IGNORECASE))

print('Good Reviews Count: ', good_review_count)
print('Poor Reviews Count: ', poor_review_count)

Good Reviews Count:  102217
Poor Reviews Count:  12700


## Part 1: Gift Occasion

#### As my first analysis on the gift occasions (Christmas, Birthday, Anniversary), I tried to find the count of each occasions. Note that I used '|' to include as much possibilities to spell a word as possible.  

In [9]:
poor_christmas = len(re.findall(r'\b(christmas|xmas|x-mas)\b',poor_review_read, flags = re.IGNORECASE))
poor_birthday = len(re.findall(r'\b(birthday|bday|b-day)\b',poor_review_read, flags = re.IGNORECASE))
poor_anniversary = len(re.findall(r'\b(anniv|anniversary|anniversaries)\b',poor_review_read, flags = re.IGNORECASE))

print('Poor Chirstmas Gifts: ',poor_christmas)
print('Poor Birthday Gifts: ',poor_birthday)
print('Poor Anniversary Gifts: ',poor_anniversary)

Poor Chirstmas Gifts:  76
Poor Birthday Gifts:  470
Poor Anniversary Gifts:  4


In [10]:
print('Poor Chirstmas Gifts: ',round(poor_christmas/poor_review_count*100,2),'%')
print('Poor Birthday Gifts: ', round(poor_birthday/poor_review_count*100,2),'%')
print('Poor Anniversary Gifts: ',round(poor_anniversary/poor_review_count*100,2),'%')

Poor Chirstmas Gifts:  0.6 %
Poor Birthday Gifts:  3.7 %
Poor Anniversary Gifts:  0.03 %


In [11]:
good_christmas = len(re.findall(r'\b(christmas|xmas|x-mas)\b',good_review_read, flags = re.IGNORECASE))
good_birthday = len(re.findall(r'\b(birthday|bday|b-day)\b',good_review_read, flags = re.IGNORECASE))
good_anniversary = len(re.findall(r'\b(anniv|anniversary|anniversaries)\b',good_review_read, flags = re.IGNORECASE))

print('Good Chirstmas Gifts: ',good_christmas)
print('Good Birthday Gifts: ',good_birthday)
print('Good Anniversary Gifts: ',good_anniversary)

Good Chirstmas Gifts:  1287
Good Birthday Gifts:  4158
Good Anniversary Gifts:  53


In [12]:
print('Good Chirstmas Gifts: ',round(good_christmas/good_review_count*100,2),'%')
print('Good Birthday Gifts: ',round(good_birthday/good_review_count*100,2),'%')
print('Good Anniversary Gifts: ',round(good_anniversary/good_review_count*100,2),'%')

Good Chirstmas Gifts:  1.26 %
Good Birthday Gifts:  4.07 %
Good Anniversary Gifts:  0.05 %


There were more number of toys purchased for BIRTHDAY than CHRISTMAS and ANNIVERSARY. This can be found by comparing the percentage of how each occasion accounts for the total number of good/poor reviews. 

In [13]:
print('Good Christmas Gift Ratio: ', round(good_christmas / (poor_christmas + good_christmas)*100,2),'%')

print('Good Birthday Gift Ratio: ', round(good_birthday / (poor_birthday + good_birthday)*100,2),'%')

print('Good Anniversary Gift Ratio: ', round(good_anniversary / (poor_anniversary + good_anniversary) *100,2),'%')

Good Christmas Gift Ratio:  94.42 %
Good Birthday Gift Ratio:  89.84 %
Good Anniversary Gift Ratio:  92.98 %


However, when comparing the efficiency of the reviews, CHIRSTMAS toys tended to have better reviews. In another word, the consumers were more satisfied with purchases they made for CHRISTMAS than any other occasions.

For further details, refer to Part 4 below.

## Part 2: Gift Recipients by Gender

In [14]:
boy_good_review = len(re.findall(r'\b(boy|boys)\b',good_review_read, flags = re.IGNORECASE))
son_good_review = len(re.findall(r'\b(son|sons)\b',good_review_read, flags = re.IGNORECASE))
husband_good_review = len(re.findall(r'\b(husband)\b',good_review_read, flags = re.IGNORECASE))
brother_good_review = len(re.findall(r'\b(brother|brothers)\b',good_review_read, flags = re.IGNORECASE))
father_good_review = len(re.findall(r'\b(father|dad|daddy)\b',good_review_read, flags = re.IGNORECASE))
boyfriend_good_review = len(re.findall(r'\b(boyfriend)\b',good_review_read, flags = re.IGNORECASE))
gradnson_good_review = len(re.findall(r'\b(grandson|grandsons)\b',good_review_read, flags = re.IGNORECASE))
nephew_good_review = len(re.findall(r'\b(nephew|nephews)\b',good_review_read, flags = re.IGNORECASE))
uncle_good_review = len(re.findall(r'\b(uncle|uncles)\b',good_review_read, flags = re.IGNORECASE))
grandfather_good_review = len(re.findall(r'\b(grandfather|grandpa|granddad)\b',good_review_read, flags = re.IGNORECASE))
man_good_review = len(re.findall(r'\b(man|men)\b',good_review_read, flags = re.IGNORECASE))
male_good_review = boy_good_review + son_good_review + husband_good_review + brother_good_review + father_good_review + boyfriend_good_review + gradnson_good_review + nephew_good_review + uncle_good_review + grandfather_good_review + man_good_review


print('boy (good review): ',boy_good_review)
print('son (good review): ',son_good_review)
print('husband (good review): ',husband_good_review)
print('brother (good review): ',brother_good_review)
print('father (good review): ',father_good_review)
print('boyfriend (good review): ',boyfriend_good_review)
print('gradnson (good review): ',gradnson_good_review)
print('nephew (good review): ',nephew_good_review)
print('uncle (good review): ',uncle_good_review)
print('grandfather (good review): ',grandfather_good_review)
print('man (good review): ',man_good_review)
print('Total MALE Count for GOOD Reviews: ', male_good_review)

boy (good review):  1775
son (good review):  7470
husband (good review):  784
brother (good review):  549
father (good review):  558
boyfriend (good review):  183
gradnson (good review):  4382
nephew (good review):  1493
uncle (good review):  38
grandfather (good review):  125
man (good review):  443
Total MALE Count for GOOD Reviews:  17800


In [15]:
girl_good_review = len(re.findall(r'\b(girl|girls)\b',good_review_read, flags = re.IGNORECASE))
daughter_good_review = len(re.findall(r'\b(daughter|daughters)\b',good_review_read, flags = re.IGNORECASE))
wife_good_review = len(re.findall(r'\b(wife)\b',good_review_read, flags = re.IGNORECASE))
sister_good_review = len(re.findall(r'\b(sister|sisters)\b',good_review_read, flags = re.IGNORECASE))
mother_good_review = len(re.findall(r'\b(mother|mom|mommy)\b',good_review_read, flags = re.IGNORECASE))
girlfriend_good_review = len(re.findall(r'\b(girlfriend)\b',good_review_read, flags = re.IGNORECASE))
gradndaughter_good_review = len(re.findall(r'\b(granddaughter|granddaughters)\b',good_review_read, flags = re.IGNORECASE))
niece_good_review = len(re.findall(r'\b(niece|nieces)\b',good_review_read, flags = re.IGNORECASE))
aunt_good_review = len(re.findall(r'\b(aunt|aunts)\b',good_review_read, flags = re.IGNORECASE))
grandmother_good_review = len(re.findall(r'\b(grandmother|grandma|granny)\b',good_review_read, flags = re.IGNORECASE))
woman_good_review = len(re.findall(r'\b(woman|women)\b',good_review_read, flags = re.IGNORECASE))
female_good_review = girl_good_review+daughter_good_review+wife_good_review+sister_good_review+mother_good_review+girlfriend_good_review+gradndaughter_good_review+niece_good_review+aunt_good_review+grandmother_good_review+woman_good_review

print('girl (good review): ',girl_good_review)
print('daughter (good review): ',daughter_good_review)
print('wife (good review): ',wife_good_review)
print('brother (good review): ',sister_good_review)
print('sister (good review): ',mother_good_review)
print('girlfriend (good review): ',girlfriend_good_review)
print('gradndaughter (good review): ',gradndaughter_good_review)
print('niece (good review): ',niece_good_review)
print('aunt (good review): ',aunt_good_review)
print('grandmother (good review): ',grandmother_good_review)
print('woman (good review): ',woman_good_review)
print('Total FEMALE Count for GOOD Reviews: ', female_good_review)

girl (good review):  2290
daughter (good review):  7625
wife (good review):  397
brother (good review):  563
sister (good review):  813
girlfriend (good review):  145
gradndaughter (good review):  3052
niece (good review):  1257
aunt (good review):  34
grandmother (good review):  202
woman (good review):  95
Total FEMALE Count for GOOD Reviews:  16473


In [16]:
boy_poor_review = len(re.findall(r'\b(boy|boys)\b',poor_review_read, flags = re.IGNORECASE))
son_poor_review = len(re.findall(r'\b(son|sons)\b',poor_review_read, flags = re.IGNORECASE))
husband_poor_review = len(re.findall(r'\b(husband)\b',poor_review_read, flags = re.IGNORECASE))
brother_poor_review = len(re.findall(r'\b(brother|brothers)\b',poor_review_read, flags = re.IGNORECASE))
father_poor_review = len(re.findall(r'\b(father|dad|daddy)\b',poor_review_read, flags = re.IGNORECASE))
boyfriend_poor_review = len(re.findall(r'\b(boyfriend)\b',poor_review_read, flags = re.IGNORECASE))
gradnson_poor_review = len(re.findall(r'\b(grandson|grandsons)\b',poor_review_read, flags = re.IGNORECASE))
nephew_poor_review = len(re.findall(r'\b(nephew|nephews)\b',poor_review_read, flags = re.IGNORECASE))
uncle_poor_review = len(re.findall(r'\b(uncle|uncles)\b',poor_review_read, flags = re.IGNORECASE))
grandfather_poor_review = len(re.findall(r'\b(grandfather|grandpa|granddad)\b',poor_review_read, flags = re.IGNORECASE))
man_poor_review = len(re.findall(r'\b(man|men)\b',poor_review_read, flags = re.IGNORECASE))
male_poor_review = boy_poor_review + son_poor_review + husband_poor_review + brother_poor_review + father_poor_review + boyfriend_poor_review + gradnson_poor_review + nephew_poor_review + uncle_poor_review + grandfather_poor_review + man_poor_review

print('boy (poor review): ',boy_poor_review)
print('son (poor review): ',son_poor_review)
print('husband (poor review): ',husband_poor_review)
print('brother (poor review): ',brother_poor_review)
print('father (poor review): ',father_poor_review)
print('boyfriend (poor review): ',boyfriend_poor_review)
print('gradnson (poor review): ',gradnson_poor_review)
print('nephew (poor review): ',nephew_poor_review)
print('uncle (poor review): ',uncle_poor_review)
print('grandfather (poor review): ',grandfather_poor_review)
print('man (poor review): ',man_poor_review)
print('Total MALE Count for POOR Reviews: ', male_poor_review)

boy (poor review):  95
son (poor review):  701
husband (poor review):  63
brother (poor review):  24
father (poor review):  31
boyfriend (poor review):  7
gradnson (poor review):  195
nephew (poor review):  51
uncle (poor review):  7
grandfather (poor review):  0
man (poor review):  39
Total MALE Count for POOR Reviews:  1213


In [17]:
girl_poor_review = len(re.findall(r'\b(girl|girls)\b',poor_review_read, flags = re.IGNORECASE))
daughter_poor_review = len(re.findall(r'\b(daughter|daughters)\b',poor_review_read, flags = re.IGNORECASE))
wife_poor_review = len(re.findall(r'\b(wife)\b',poor_review_read, flags = re.IGNORECASE))
sister_poor_review = len(re.findall(r'\b(sister|sisters)\b',poor_review_read, flags = re.IGNORECASE))
mother_poor_review = len(re.findall(r'\b(mother|mom|mommy)\b',poor_review_read, flags = re.IGNORECASE))
girlfriend_poor_review = len(re.findall(r'\b(girlfriend)\b',poor_review_read, flags = re.IGNORECASE))
gradndaughter_poor_review = len(re.findall(r'\b(granddaughter|granddaughters)\b',poor_review_read, flags = re.IGNORECASE))
niece_poor_review = len(re.findall(r'\b(niece|nieces)\b',poor_review_read, flags = re.IGNORECASE))
aunt_poor_review = len(re.findall(r'\b(aunt|aunts)\b',poor_review_read, flags = re.IGNORECASE))
grandmother_poor_review = len(re.findall(r'\b(grandmother|grandma|granny)\b',poor_review_read, flags = re.IGNORECASE))
woman_poor_review = len(re.findall(r'\b(woman|women)\b',poor_review_read, flags = re.IGNORECASE))
female_poor_review = girl_poor_review+daughter_poor_review+wife_poor_review+sister_poor_review+mother_poor_review+girlfriend_poor_review+gradndaughter_poor_review+niece_poor_review+aunt_poor_review+grandmother_poor_review+woman_poor_review


print('girl (poor review): ',girl_poor_review)
print('daughter (poor review): ',daughter_poor_review)
print('wife (poor review): ',wife_poor_review)
print('brother (poor review): ',sister_poor_review)
print('sister (poor review): ',mother_poor_review)
print('girlfriend (poor review): ',girlfriend_poor_review)
print('gradndaughter (poor review): ',gradndaughter_poor_review)
print('niece (poor review): ',niece_poor_review)
print('aunt (poor review): ',aunt_poor_review)
print('grandmother (poor review): ',grandmother_poor_review)
print('woman (poor review): ',woman_poor_review)
print('Total FEMALE Count for POOR Reviews: ',female_poor_review)

girl (poor review):  107
daughter (poor review):  465
wife (poor review):  24
brother (poor review):  26
sister (poor review):  45
girlfriend (poor review):  1
gradndaughter (poor review):  95
niece (poor review):  48
aunt (poor review):  0
grandmother (poor review):  8
woman (poor review):  15
Total FEMALE Count for POOR Reviews:  834


In [18]:
print('Total MALE Count for GOOD Reviews: ', male_good_review)
print('Total FEMALE Count for GOOD Reviews: ', female_good_review)
print('Total MALE Count for POOR Reviews: ', male_poor_review)
print('Total FEMALE Count for POOR Reviews: ',female_poor_review)

Total MALE Count for GOOD Reviews:  17800
Total FEMALE Count for GOOD Reviews:  16473
Total MALE Count for POOR Reviews:  1213
Total FEMALE Count for POOR Reviews:  834


In [19]:
print('% of MALE Recipients for GOOD Reviews: ', round(male_good_review/good_review_count*100,2),'%')
print('% of FEMALE Recipients for GOOD Reviews: ', round(female_good_review/good_review_count*100,2),'%')
print('% of MALE Recipients for POOR Reviews: ', round(male_poor_review/poor_review_count*100,2),'%')
print('% of FEMALE Recipients for POOR Reviews: ',round(female_poor_review/poor_review_count*100,2),'%')

% of MALE Recipients for GOOD Reviews:  17.41 %
% of FEMALE Recipients for GOOD Reviews:  16.12 %
% of MALE Recipients for POOR Reviews:  9.55 %
% of FEMALE Recipients for POOR Reviews:  6.57 %


In [20]:
print('Of all the GOOD reviews that included information of recepients gender:')
print('MALE Percentage: ', round(male_good_review/(male_good_review+female_good_review)*100,2),'%')
print('FEMALE Percentage: ',round(female_good_review/(male_good_review+female_good_review)*100,2),'%')


print('Of all the POOR reviews that included information of recepients gender:')
print('MALE Percentage: ', round(male_poor_review/(male_poor_review+female_poor_review)*100,2),'%')
print('FEMALE Percentage: ',round(female_poor_review/(male_poor_review+female_poor_review)*100,2),'%')

Of all the GOOD reviews that included information of recepients gender:
MALE Percentage:  51.94 %
FEMALE Percentage:  48.06 %
Of all the POOR reviews that included information of recepients gender:
MALE Percentage:  59.26 %
FEMALE Percentage:  40.74 %


It can be observed that there are more poor reviews associated with male than female. On the other hand, the distribution between two genders was pretty even for positive reviews. 

For further details, refer to Part 4 below.

#### I also took a differnt method to identify the recipients by the gender. First, I created 2 lists, one for each gender, that include possible ways to refer the gift recipients. 

An advantage of this method is that it is easier to add or remove any new "title" later on. The user can simply update the list and run the codes.

In [21]:
male = ['boy', 'boys', 'son', 'sons', 'husband', 'brother', 'brothers', 'father', 'dad', 'daddy', 'boyfriend', 'grandson', 'gradnsons', 'nephew', 'nephews', 'uncle', 'uncles', 'grandfather', 'grandpa', 'granddad', 'man', 'men']
female = ['girl', 'girls', 'daughter', 'daughters', 'wife', 'sister', 'sisters', 'mother', 'mom', 'mommy', 'girlfriend', 'granddaughter', 'granddaughters' 'niece', 'nieces', 'aunt', 'aunts', 'grandmother', 'grandma', 'granny', 'woman', 'women']

Then, I created 2 dataframes that holds the count of each words in the list for the good_review and bad_review.

Note that I don't have to worry about capital letters because all the words in the dataframes are already all in lower cases.

In [22]:
male_df1 = pd.DataFrame(columns = ['he/him/his','good_review_count'])
for i in male:
    try:
        male_df1 = male_df1.append(pd.DataFrame({'he/him/his': i, 
                                               'good_review_count': int(good_review_df.loc[good_review_df.word == i,'frequency'])}, 
                                              index=[0]), ignore_index=True)
    except:
        male_df1 = male_df1.append(pd.DataFrame({'he/him/his': i,
                                               'good_review_count': 0},index=[0]), ignore_index=True)
male_df1.set_index('he/him/his', inplace=True)
male_df1

Unnamed: 0_level_0,good_review_count
he/him/his,Unnamed: 1_level_1
boy,563
boys,748
son,5869
sons,355
husband,671
brother,319
brothers,70
father,63
dad,234
daddy,69


In [23]:
male_df2 = pd.DataFrame(columns = ['he/him/his','bad_review_count'])
for i in male:
    try:
        male_df2 = male_df2.append(pd.DataFrame({'he/him/his': i, 
                                               'bad_review_count': int(bad_review_df.loc[bad_review_df.word == i,'frequency'])}, 
                                              index=[0]), ignore_index=True)
    except:
        male_df2 = male_df2.append(pd.DataFrame({'he/him/his': i,
                                               'bad_review_count': 0},index=[0]), ignore_index=True)
male_df2.set_index('he/him/his', inplace=True)
male_df2

Unnamed: 0_level_0,bad_review_count
he/him/his,Unnamed: 1_level_1
boy,38
boys,30
son,511
sons,48
husband,56
brother,17
brothers,3
father,7
dad,12
daddy,1


Next, I simply combined the two dataframes.

In [24]:
male_df = pd.concat([male_df1, male_df2], axis=1, join='inner')
male_df.reset_index(inplace = True)
male_df.sort_values(by='good_review_count', ascending=False, inplace=True)
male_df

Unnamed: 0,he/him/his,good_review_count,bad_review_count
2,son,5869,511
11,grandson,3178,121
13,nephew,990,29
1,boys,748,30
4,husband,671,56
0,boy,563,38
3,sons,355,48
5,brother,319,17
8,dad,234,12
20,man,188,17


In [25]:
print ('Total number of Male recipients in Good Review: ',male_df['good_review_count'].sum())
print ('Total number of Male recipients in Bad Review: ',male_df['bad_review_count'].sum())

Total number of Male recipients in Good Review:  13754
Total number of Male recipients in Bad Review:  906


#### I repeated the steps above for the female recipients. 

In [26]:
female_df1 = pd.DataFrame(columns = ['she/her/her','good_review_count'])
for i in female:
    try:
        female_df1 = female_df1.append(pd.DataFrame({'she/her/her': i, 
                                               'good_review_count': int(good_review_df.loc[good_review_df.word == i,'frequency'])}, 
                                              index=[0]), ignore_index=True)
    except:
        female_df1 = female_df1.append(pd.DataFrame({'she/her/her': i,
                                               'good_review_count': 0},index=[0]), ignore_index=True)
female_df1.set_index('she/her/her', inplace=True)

In [27]:
female_df2 = pd.DataFrame(columns = ['she/her/her','bad_review_count'])
for i in female:
    try:
        female_df2 = female_df2.append(pd.DataFrame({'she/her/her': i, 
                                               'bad_review_count': int(bad_review_df.loc[bad_review_df.word == i,'frequency'])}, 
                                              index=[0]), ignore_index=True)
    except:
        female_df2 = female_df2.append(pd.DataFrame({'she/her/her': i,
                                               'bad_review_count': 0},index=[0]), ignore_index=True)
female_df2.set_index('she/her/her', inplace=True)

In [28]:
female_df = pd.concat([female_df1, female_df2], axis=1, join='inner')
female_df.reset_index(inplace = True)
female_df.sort_values(by='good_review_count', ascending=False, inplace=True)
female_df

Unnamed: 0,she/her/her,good_review_count,bad_review_count
2,daughter,5821,332
11,granddaughter,2184,61
0,girl,1006,49
1,girls,759,27
3,daughters,560,39
5,sister,354,14
8,mom,351,15
4,wife,323,23
7,mother,166,19
13,nieces,130,6


In [29]:
print ('Total number of Female recipients in Good Review: ',female_df['good_review_count'].sum())
print ('Total number of Female recipients in Bad Review: ',female_df['bad_review_count'].sum())

Total number of Female recipients in Good Review:  12072
Total number of Female recipients in Bad Review:  604


In [30]:
gender_good_review = pd.DataFrame(columns = ['gender', 'good_review_count'])
gender_good_review = gender_good_review.append(pd.DataFrame({'gender': 'Male', 
                               'good_review_count': male_df['good_review_count'].sum()},
                              index=[0]), ignore_index=True)
gender_good_review = gender_good_review.append(pd.DataFrame({'gender': 'Female', 
                               'good_review_count': female_df['good_review_count'].sum()},
                              index=[0]), ignore_index=True)
gender_good_review

Unnamed: 0,gender,good_review_count
0,Male,13754
1,Female,12072


In [31]:
gender_bad_review = pd.DataFrame(columns = ['gender', 'bad_review_count'])
gender_bad_review = gender_bad_review.append(pd.DataFrame({'gender': 'Male', 
                               'bad_review_count': male_df['bad_review_count'].sum()},
                              index=[0]), ignore_index=True)
gender_bad_review = gender_bad_review.append(pd.DataFrame({'gender': 'Female', 
                               'bad_review_count': female_df['bad_review_count'].sum()},
                              index=[0]), ignore_index=True)
gender_bad_review

Unnamed: 0,gender,bad_review_count
0,Male,906
1,Female,604


NOTE: For some reason, the numbers do not seem to match between the first one using re and the second one finding the word from the dataframe. I spent more time than I would like to figure this discrepancy, but could not explain why I was getting different results. Still, I am including this part of the analysis just to show this is a potential alternative method. 

## Part 3: Limitation of the word count analysis

#### As much as the analysis can add value to the future marketing strategy, I have also noticed some limitation as I was performing the analysis above.

First, the analysis above does not take consideration of the repetition. For example, it is very possible that the user included differenr words that indicate the same person, such as son, he, and him, in the same review. A possible method can be going through a review one by one and see if contains any word that tells who the gift was for (using for loop for each review and another for loop to go through the male/female list) and get the unique count. Unfortunately, this does not seem very efficient way for now and might be conflicted if the review contains information for both genders.

Second, we may not know which occasion the gift was for unless it was specifically written in the review. For a more accurate analysis in future, it may be helpful to know the dates when the toys were ordered to more accurately infer the occasion of the gift. User information (birthday or past orders) would also be helpful.

Third, if the purchase was not intended as a gift but something the reviewer purchased for him/herself, the count analysis can be limited as it is difficult to determine whether the reviewer is a male or female. Therefore, the user gender information should be taken into consideration.

Lastly, the words we used to identify specific information, whether is the occasion or gender, may not be all possible values to do so. There might be a spelling error on reviewers' side or mistake from our side to not include all possible words. 

## Part 4: Conclusions

#### Which gift occasions (Christmas, birthdays, or anniversaries) tend to have the most positive reviews?

In [32]:
print('Good Chirstmas Gifts: ',round(good_christmas/good_review_count*100,2),'%')
print('Good Birthday Gifts: ',round(good_birthday/good_review_count*100,2),'%')
print('Good Anniversary Gifts: ',round(good_anniversary/good_review_count*100,2),'%')

Good Chirstmas Gifts:  1.26 %
Good Birthday Gifts:  4.07 %
Good Anniversary Gifts:  0.05 %


In [33]:
print('Poor Chirstmas Gifts: ',round(poor_christmas/poor_review_count*100,2),'%')
print('Poor Birthday Gifts: ', round(poor_birthday/poor_review_count*100,2),'%')
print('Poor Anniversary Gifts: ',round(poor_anniversary/poor_review_count*100,2),'%')

Poor Chirstmas Gifts:  0.6 %
Poor Birthday Gifts:  3.7 %
Poor Anniversary Gifts:  0.03 %


This result shows that the BIRTHDAY gifts are more commonly found in both good and poor reviews than CHRISTMAS or ANNIVERSARY gifts. This simply indicates that the consumers bought the toys more for BIRTHDAYS than other 2 occasions. 

In [34]:
print('Good Christmas Gift Ratio: ', round(good_christmas / (poor_christmas + good_christmas)*100,2),'%')

print('Good Birthday Gift Ratio: ', round(good_birthday / (poor_birthday + good_birthday)*100,2),'%')

print('Good Anniversary Gift Ratio: ', round(good_anniversary / (poor_anniversary + good_anniversary) *100,2),'%')

Good Christmas Gift Ratio:  94.42 %
Good Birthday Gift Ratio:  89.84 %
Good Anniversary Gift Ratio:  92.98 %


When trying to see which of these occasions actually have a higher tendency of getting positive reviews, I was able to notice that the users were more likly to be satisfied with purchases made for CHRISTMAS.

Therefore, Amazon should try to promote gifts for CHRISTMAS as it has shown to provide the most effective and positive experiences for the users. CHRISTMAS was chosen over ANNIVERSARY for following reasons:

1) CHRISTMAS is much easier to expect and monitor compared to users' ANNIVERSARIES. Amazon would need to either collect additional data or predict the anniversary dates for each users while Christmas is fixed.

2) Due to the nature of ANNIVERSARIES, the associated gifts are more likely to be personalized and may not be purchased as much through Amazon. Such observation can be found by looking at the total count for each occasion.

3) CHRISTMAS shows better review rate with 94.42% as shown above.


At the same time, Amazon should work on improving the quality of the BIRTHDAY gifts in order to take advantage of the high volume of demand for such occasion.

#### Is it true that the toys purchased for male recipients (husbands, sons, etc.) tend to be much more likely to be reviewed poorly?

In [35]:
print('Total MALE Count for GOOD Reviews: ', male_good_review)
print('Total FEMALE Count for GOOD Reviews: ', female_good_review)
print('Total MALE Count for POOR Reviews: ', male_poor_review)
print('Total FEMALE Count for POOR Reviews: ',female_poor_review)

Total MALE Count for GOOD Reviews:  17800
Total FEMALE Count for GOOD Reviews:  16473
Total MALE Count for POOR Reviews:  1213
Total FEMALE Count for POOR Reviews:  834


In [36]:
print('Of all the GOOD reviews that included information of recepients gender:')
print('MALE Percentage: ', round(male_good_review/(male_good_review+female_good_review)*100,2),'%')
print('FEMALE Percentage: ',round(female_good_review/(male_good_review+female_good_review)*100,2),'%')

Of all the GOOD reviews that included information of recepients gender:
MALE Percentage:  51.94 %
FEMALE Percentage:  48.06 %


In [37]:
print('Of all the POOR reviews that included information of recepients gender:')
print('MALE Percentage: ', round(male_poor_review/(male_poor_review+female_poor_review)*100,2),'%')
print('FEMALE Percentage: ',round(female_poor_review/(male_poor_review+female_poor_review)*100,2),'%')

Of all the POOR reviews that included information of recepients gender:
MALE Percentage:  59.26 %
FEMALE Percentage:  40.74 %


As seen above, around 60% of the recipients out of the poor reviews that had gender information were male and about 40% female where as the same ratio for good reviews were about half and half. Important to note is that this is not the ratio of all the reviews but only the reviews with the gender information.

Still, it can be seen that toys purchased for male recipients have higher percentage of poor reviews than female recipient. Although further statistical analysis would be helpful to confirm this finding, it appears that the initial hypothesis seems to hold true.

## Part 5: Describe myself

In [38]:
me = open("describe_myself.csv", "r", encoding = 'UTF-8')
me = me.read()
me

'\ufeff"My name is Sam and I am currently a first year in the MSBA program at USC. For my undergrad, I studied at Notre Dame. ☘️",,,,,,,,,\n"Even though I am from 대한민국, I have lived in other countries like China, Austria, and US.",,,,,,,,,\nI like to watch and play sports like tennis and basketball. ,,,,,,,,,\n,,,,,,,,,'