# Homework 1 (Due Tuesday, March 30th, 2021 at 6:29pm PST)

Every day late is -10%.

You are a business analyst working for a major US toy retailer:

* A manager in the marketing department wants to find out the most frequently used words in positive reviews (five stars) and negative reviews (one star) in order to determine what occasion the toys are purchased for (Christmas, birthdays, and anniversaries.). He would like your opinion on **which gift occasions (Christmas, birthdays, or anniversaries) tend to have the most positive reviews** to focus marketing budget on those days.

* There are malformed characters in the review text. For instance, notice the `&#34;` - these are examples of incorrectly decoded [HTML encodings](https://krypted.com/utilities/html-encoding-reference/).
```
"amazing quality first of all, these cards are amazing proxies (but don't try to use em in &#34;official duels&#34; unless a judge is okay with it, if you have the real thing to show) and look amazing in your binder!"
```
Please clean up all instances of these incorrect decodings.

* One of your product managers suspects that **toys purchased for male recipients (husbands, sons, etc.)** tend to be much more likely to be reviewed poorly. She would like to see some data points confirming or rejecting her hypothesis. 

* Use **regular expressions to parse out all references to recipients and gift occassions**, and account for the possibility that people may spell words "son" / "children" / "Christmas" as both singular and plural, upper or lower-cased.

* Explain what some of **pitfalls/limitations** are of using only a word count analysis to make these inferences. What additional research/steps would you need to do to verify your conclusions?

* **Create a simple text file that contains 2-3 lines at most describing yourself, your background, and interests. It must contain at least 1 emoji and 4-5 international characters (non-ASCII)**. Make sure to properly encode the file so that I can open it in `UTF-8` to read. I must be able to read all characters properly. Attach it to your submission.

Perform the same word count analysis using the reviews received from Amazon to answer your marketing manager's question. They are stored in two files, (`poor_amazon_toy_reviews.txt`) and (`good-amazon-toy-reviews.txt`). **Provide a few sentences with your findings and business recommendations.** Make any assumptions you'd like to- this is a fictitious company after all. I just want you to get into the habit of "finishing" your analysis: to avoid delivering technical numbers to a non-technical manager.

**Submit everything as a new notebook and Slack direct message to me (Yu Chen) and the TA the HW as an attachment.**

**NOTE**: Name the notebook `lastname_firstname_HW1.ipynb`.

# EDA and Data Preprocessing

In [1]:
from collections import Counter
import re
import pandas as pd

In [2]:
poor_reviews=open('poor_amazon_toy_reviews.txt','r',encoding='latin1')
good_reviews=open('good_amazon_toy_reviews.txt','r',encoding='latin1')

In [3]:
poor_lines=poor_reviews.readlines()
good_lines=good_reviews.readlines()

In [4]:
poor_lines

["Do not buy these! They break very fast I spun then for 15 minutes and the end flew off don't waste your money. They are made from cheap plastic and have cracks in them. Buy the poi balls they work a lot better if you only have limited funds.\n",
 "Showed up not how it's shown . Was someone's old toy. with paint on it.\n",
 'You need expansion packs 3-5 if you want access to the player aids for the Factions expansion. The base game of Alien Frontiers just plays so much smoother than adding Factions with the expansion packs. All this will do is pigeonhole you into a certain path to victory.\n',
 '"This was to be a gift for my husband for our new pool. Did not receive the color I ordered but most of all after only one month of use (not continuously) the mesh pulled away from the material and the inflatable side. Completely shredded and no longer of use. It was stored properly and was not kept outside or in the pool. Poorly made, better off going to W**-M*** and getting something on clea

In [5]:
good_lines

['Excellent!!!\n',
 '"Great quality wooden track (better than some others we have tried). Perfect match to the various vintages of Thomas track that we already have. There is enough track here to have fun and get creative incorporating your key pieces with track splits, loops and bends."\n',
 'my daughter loved it and i liked the price and it came to me rather than shopping with a ton of people around me. Amazon is the Best way to shop!\n',
 'Great item. Pictures pop thru and add detail as &#34;painted.&#34;  Pictures dry and it can be repainted.\n',
 'I was pleased with the product.\n',
 'Children like it\n',
 '"Really liked these. They were a little larger than I thought, but still fun."\n',
 '"Nice huge balloon! Had my local grocery store fill it up for a very small fee, it was totally worth it!"\n',
 'Great deal\n',
 'awesome ! Thanks!\n',
 'I got this item for me and my son to play around with. The closest relevance I have to items like these is while in the army I was trained in 

In [6]:
for line in poor_lines:
    print(line)
    break

Do not buy these! They break very fast I spun then for 15 minutes and the end flew off don't waste your money. They are made from cheap plastic and have cracks in them. Buy the poi balls they work a lot better if you only have limited funds.



In [7]:
words_poor = Counter()
for line in poor_lines:
    for word in line.split(" "):
        words_poor[word] += 1
words_poor

Counter({'Do': 230,
         'not': 5459,
         'buy': 832,
         'these!': 7,
         'They': 626,
         'break': 121,
         'very': 1825,
         'fast': 45,
         'I': 9629,
         'spun': 6,
         'then': 556,
         'for': 4777,
         '15': 75,
         'minutes': 294,
         'and': 10909,
         'the': 18992,
         'end': 114,
         'flew': 46,
         'off': 816,
         "don't": 781,
         'waste': 668,
         'your': 852,
         'money.': 224,
         'are': 2400,
         'made': 628,
         'from': 1173,
         'cheap': 485,
         'plastic': 479,
         'have': 2214,
         'cracks': 9,
         'in': 4129,
         'them.': 218,
         'Buy': 33,
         'poi': 1,
         'balls': 115,
         'they': 1776,
         'work': 658,
         'a': 9203,
         'lot': 147,
         'better': 294,
         'if': 870,
         'you': 1973,
         'only': 1161,
         'limited': 10,
         'funds.\n': 1,
        

In [8]:
words_good = Counter()
for line in good_lines:
    for word in line.split(" "):
        words_good[word] += 1
words_good

Counter({'Excellent!!!\n': 10,
         '"Great': 1381,
         'quality': 3740,
         'wooden': 255,
         'track': 369,
         '(better': 3,
         'than': 5549,
         'some': 3889,
         'others': 470,
         'we': 6175,
         'have': 14375,
         'tried).': 2,
         'Perfect': 1674,
         'match': 231,
         'to': 63339,
         'the': 96606,
         'various': 226,
         'vintages': 1,
         'of': 36593,
         'Thomas': 153,
         'that': 18143,
         'already': 764,
         'have.': 148,
         'There': 1241,
         'is': 42321,
         'enough': 2094,
         'here': 650,
         'fun': 6507,
         'and': 86557,
         'get': 6040,
         'creative': 232,
         'incorporating': 5,
         'your': 4349,
         'key': 140,
         'pieces': 1925,
         'with': 28829,
         'splits,': 3,
         'loops': 31,
         'bends."\n': 1,
         'my': 28718,
         'daughter': 5522,
         'loved': 7196

In [9]:
len(good_lines)

102217

In [10]:
len(poor_lines)

12700

In [11]:
good_count_df=pd.DataFrame(columns=["word", "frequency"])
good_count_df["word"] = list(words_good.keys())
good_count_df["frequency"] = list(words_good.values())
poor_count_df=pd.DataFrame(columns=["word", "frequency"])
poor_count_df["word"] = list(words_poor.keys())
poor_count_df["frequency"] = list(words_poor.values())

In [12]:
good_count_df=good_count_df.sort_values('frequency',ascending=False).reset_index(drop=True)
good_count_df

Unnamed: 0,word,frequency
0,the,96606
1,and,86557
2,a,66611
3,to,63339
4,I,48358
...,...,...
125869,fees,1
125870,kick-ass,1
125871,Joe.<br,1
125872,"League&#34;,",1


In [13]:
poor_count_df=poor_count_df.sort_values('frequency',ascending=False).reset_index(drop=True)
poor_count_df

Unnamed: 0,word,frequency
0,the,18992
1,and,10909
2,I,9629
3,to,9589
4,a,9203
...,...,...
37309,strong.\n,1
37310,10.00\n,1
37311,color.....instead,1
37312,buckets,1


In [14]:
poor_df=pd.DataFrame(open('poor_amazon_toy_reviews.txt','r',encoding='latin1'),columns=['line'])
poor_df['line']=poor_df['line'].str.replace("\n","")
poor_df

Unnamed: 0,line
0,Do not buy these! They break very fast I spun ...
1,Showed up not how it's shown . Was someone's o...
2,You need expansion packs 3-5 if you want acces...
3,"""This was to be a gift for my husband for our ..."
4,Received a pineapple rather than the advertise...
...,...
12695,It's a piece of junk...doesn't charge multiple...
12696,Really small
12697,It is contained in glass which is dangerous if...
12698,"""Fake. Not original. Every time my 5 yr old ki..."


In [15]:
good_df=pd.DataFrame(open('good_amazon_toy_reviews.txt','r',encoding='latin1'),columns=['line'])
good_df['line']=good_df['line'].str.replace("\n","")
good_df

Unnamed: 0,line
0,Excellent!!!
1,"""Great quality wooden track (better than some ..."
2,my daughter loved it and i liked the price and...
3,Great item. Pictures pop thru and add detail a...
4,I was pleased with the product.
...,...
102212,fun game
102213,"""Nice kit,well priced"""
102214,Does what it is supposed to do.
102215,Grandson loves playing with these police figur...


# Bullet Point 1

* A manager in the marketing department wants to find out the most frequently used words in positive reviews (five stars) and negative reviews (one star) in order to determine what occasion the toys are purchased for (Christmas, birthdays, and anniversaries.). He would like your opinion on **which gift occasions (Christmas, birthdays, or anniversaries) tend to have the most positive reviews** to focus marketing budget on those days.

In [16]:
good_df['Christmas']=good_df.apply(lambda row: re.search(r'\b(CHRISTMAS|Christmas|christmas|XMAS|Xmas|xmas|SANTA|Santa|santa)\b',row.line),axis = 1)
good_df['Birthday']=good_df.apply(lambda row: re.search(r'\b(BIRTHDAY|Birthday|birthday|BDAY|Bday|bday)\b',row.line),axis = 1)
good_df['Anniversary']=good_df.apply(lambda row: re.search(r'\b(ANNIVERSARY|Anniversary|anniversary)\b',row.line),axis = 1)

In [17]:
print('Amount of Christmas reviews:',good_df['Christmas'].count())
print('Amount of Birthday reviews:',good_df['Birthday'].count())
print('Amount of Anniversary reviews:',good_df['Anniversary'].count())

Amount of Christmas reviews: 1232
Amount of Birthday reviews: 3916
Amount of Anniversary reviews: 50


After conducting the word count analysis, it appears that there are more birthday based positive reviews than anniversary reviews and Christmas reviews. Consequently, Christmas should be a top business priority. However, it is important to note the inherent marketing challenge for birthdays and anniversaries because those occasions are year round. Moreover, there are far more birthday reviews than anniversary reviews (same type of category). As a result, if the business had to prioritize its marketing dollars between birthdays and anniversaries, they should first be allocated to birthday advertisements and initiatives.

# Bullet Point 2

* There are malformed characters in the review text. For instance, notice the `&#34;` - these are examples of incorrectly decoded [HTML encodings](https://krypted.com/utilities/html-encoding-reference/).
```
"amazing quality first of all, these cards are amazing proxies (but don't try to use em in &#34;official duels&#34; unless a judge is okay with it, if you have the real thing to show) and look amazing in your binder!"
```
Please clean up all instances of these incorrect decodings.

In [18]:
good=good_df[['line']]
poor=poor_df[['line']]

In [19]:
for i in range(102217):
    for j in range(32,127):
        good.loc[i,'line']=re.sub(f'(&#{j};)','',good.loc[i,'line'])
    good.loc[i,'line']=re.sub(r'(/>|/><br|<br|<br>)','',good.loc[i,'line'])

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self._setitem_with_indexer(indexer, value)
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  This is separate from the ipykernel package so we can avoid doing imports until
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  after removing the cwd from sys.path.


In [20]:
for i in range(12700):
    for j in range(32,127):
        poor.loc[i,'line']=re.sub(f'(&#{j};)','',poor.loc[i,'line'])
    poor.loc[i,'line']=re.sub(r'(\n|/>|/><br|<br|<br>)','',poor.loc[i,'line'])

In [21]:
good

Unnamed: 0,line
0,Excellent!!!
1,"""Great quality wooden track (better than some ..."
2,my daughter loved it and i liked the price and...
3,Great item. Pictures pop thru and add detail a...
4,I was pleased with the product.
...,...
102212,fun game
102213,"""Nice kit,well priced"""
102214,Does what it is supposed to do.
102215,Grandson loves playing with these police figur...


In [22]:
poor

Unnamed: 0,line
0,Do not buy these! They break very fast I spun ...
1,Showed up not how it's shown . Was someone's o...
2,You need expansion packs 3-5 if you want acces...
3,"""This was to be a gift for my husband for our ..."
4,Received a pineapple rather than the advertise...
...,...
12695,It's a piece of junk...doesn't charge multiple...
12696,Really small
12697,It is contained in glass which is dangerous if...
12698,"""Fake. Not original. Every time my 5 yr old ki..."


The DataFrames above are cleaned from any incorrect HTML decodings. 

# Bullet Point 3 (Please note that bullet point 4 is just a general remark)

* One of your product managers suspects that **toys purchased for male recipients (husbands, sons, etc.)** tend to be much more likely to be reviewed poorly. She would like to see some data points confirming or rejecting her hypothesis. 

In [23]:
good['male']=good.apply(lambda row: re.search(r'\b(SON|SONS|son|sons|\
                                                         Son|Sons|BOY|BOYS|\
                                                         boy|boys|Boy|Boys|\
                                                         GRANDSON|GRANDSONS|\
                                                         grandson|grandsons|\
                                                         Grandson|Grandsons|\
                                                         GUY|guy|Guy|\
                                                         BOYFRIEND|boyfriend|Boyfriend|\
                                                         BF|bf|Bf|HUSBAND|husband|Husband|\
                                                         BROTHER|BROTHERS|brother|brothers|\
                                                         Brother|Brothers|\
                                                         GRANDPA|grandpa|Grandpa|\
                                                         GRANDPOP|grandpop|Grandpop|\
                                                         GRANDFATHER|grandfather|Grandfather|\
                                                         FATHER|father|Father|\
                                                         DAD|dad|Dad|\
                                                         DADDY|daddy|Daddy|\
                                                         UNCLE|UNCLES|\
                                                         uncle|uncles|\
                                                         Uncle|Uncles|\
                                                         GROOM|groom|Groom|\
                                                         MALE|MALES|male|males|\
                                                         Male|Males|\
                                                         MEN|men|Men|\
                                                         MAN|man|Man|\
                                                         NEPHEW|NEPHEWS|\
                                                         nephew|nephews|\
                                                         Nephew|Nephews)\b'\
                                                         ,row.line),axis = 1)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy


In [24]:
poor['male']=poor.apply(lambda row: re.search(r'\b(SON|SONS|son|sons|\
                                                         Son|Sons|BOY|BOYS|\
                                                         boy|boys|Boy|Boys|\
                                                         GRANDSON|GRANDSONS|\
                                                         grandson|grandsons|\
                                                         Grandson|Grandsons|\
                                                         GUY|guy|Guy|\
                                                         BOYFRIEND|boyfriend|Boyfriend|\
                                                         BF|bf|Bf|HUSBAND|husband|Husband|\
                                                         BROTHER|BROTHERS|brother|brothers|\
                                                         Brother|Brothers|\
                                                         GRANDPA|grandpa|Grandpa|\
                                                         GRANDPOP|grandpop|Grandpop|\
                                                         GRANDFATHER|grandfather|Grandfather|\
                                                         FATHER|father|Father|\
                                                         DAD|dad|Dad|\
                                                         DADDY|daddy|Daddy|\
                                                         UNCLE|UNCLES|\
                                                         uncle|uncles|\
                                                         Uncle|Uncles|\
                                                         GROOM|groom|Groom|\
                                                         MALE|MALES|male|males|\
                                                         Male|Males|\
                                                         MEN|men|Men|\
                                                         MAN|man|Man|\
                                                         NEPHEW|NEPHEWS|\
                                                         nephew|nephews|\
                                                         Nephew|Nephews)\b'\
                                                         ,row.line),axis = 1)

In [25]:
good['female']=good.apply(lambda row: re.search(r'\b(DAUGHTER|DAUGHTERS|daughter|daughters|\
                                                         Daughter|Daughters|GIRL|GIRLS|\
                                                         girl|girls|Girl|Girls|\
                                                         GRANDDAUGHTER|GRANDDAUGHTERS|\
                                                         granddaughter|granddaughters|\
                                                         Granddaughter|Granddaughters|\
                                                         GIRLFRIEND|girlfriend|Girlfriend|\
                                                         GF|gf|Gf|WIFE|wife|Wife|\
                                                         SISTER|SISTERS|sister|sisters|\
                                                         Sister|Sisters|\
                                                         GRANDMA|grandma|Grandma|\
                                                         GRANDMOTHER|grandmother|Grandmother|\
                                                         MOTHER|mother|Mother|\
                                                         MOM|mom|Mom|\
                                                         MOMMY|mommy|Mommy|\
                                                         AUNT|AUNTS|\
                                                         aunt|aunts|\
                                                         Aunt|Aunts|\
                                                         AUNTIE|AUNTIES|\
                                                         auntie|aunties|\
                                                         Aunties|Aunties|\
                                                         BRIDE|bride|Bride|\
                                                         FEMALE|FEMALES|female|females|\
                                                         Female|Females|\
                                                         WOMEN|women|Women|\
                                                         WOMAN|woman|Woman|\
                                                         NIECE|NIECES|\
                                                         niece|nieces|\
                                                         Niece|Nieces)\b'\
                                                         ,row.line),axis = 1)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy


In [26]:
poor['female']=poor.apply(lambda row: re.search(r'\b(DAUGHTER|DAUGHTERS|daughter|daughters|\
                                                         Daughter|Daughters|GIRL|GIRLS|\
                                                         girl|girls|Girl|Girls|\
                                                         GRANDDAUGHTER|GRANDDAUGHTERS|\
                                                         granddaughter|granddaughters|\
                                                         Granddaughter|Granddaughters|\
                                                         GIRLFRIEND|girlfriend|Girlfriend|\
                                                         GF|gf|Gf|WIFE|wife|Wife|\
                                                         SISTER|SISTERS|sister|sisters|\
                                                         Sister|Sisters|\
                                                         GRANDMA|grandma|Grandma|\
                                                         GRANDMOTHER|grandmother|Grandmother|\
                                                         MOTHER|mother|Mother|\
                                                         MOM|mom|Mom|\
                                                         MOMMY|mommy|Mommy|\
                                                         AUNT|AUNTS|\
                                                         aunt|aunts|\
                                                         Aunt|Aunts|\
                                                         AUNTIE|AUNTIES|\
                                                         auntie|aunties|\
                                                         Aunties|Aunties|\
                                                         BRIDE|bride|Bride|\
                                                         FEMALE|FEMALES|female|females|\
                                                         Female|Females|\
                                                         WOMEN|women|Women|\
                                                         WOMAN|woman|Woman|\
                                                         NIECE|NIECES|\
                                                         niece|nieces|\
                                                         Niece|Nieces)\b'\
                                                         ,row.line),axis = 1)

In [27]:
print('Amount of male good reviews:',good['male'].count())
print('Amount of male poor reviews:',poor['male'].count())
print('Amount of female good reviews:',good['female'].count())
print('Amount of female poor reviews:',poor['female'].count())

Amount of male good reviews: 10181
Amount of male poor reviews: 805
Amount of female good reviews: 9528
Amount of female poor reviews: 584


In [28]:
amt_poor_f=poor['female'].count()
amt_good_f=good['female'].count()
amt_poor_m=poor['male'].count()
amt_good_m=good['male'].count()
p_m=amt_poor_m/(amt_poor_m+amt_good_m)
p_f=amt_poor_f/(amt_poor_f+amt_good_f)
print('Porportion of poor reviews for male recipients:',round(p_m,2))
print('Porportion of poor reviews for female recipients:',round(p_f,2))

Porportion of poor reviews for male recipients: 0.07
Porportion of poor reviews for female recipients: 0.06


The analysis shows that male recipients have a greater amount of poor reviews. However, it is important to also call out that male recipients have a higher amount of good reviews. Regardless, female recipients have a lower ratio of poor reviews to total reviews (6% to 7% with respect to themselves). Looking at these proportions gives us better insight into the breakdown of poor reviews in relation to the gender of the recipient (versus looking at the raw numbers).

# Bullet Point 5

* Explain what some of **pitfalls/limitations** are of using only a word count analysis to make these inferences. What additional research/steps would you need to do to verify your conclusions?

A pure word count analysis has the potential to be misleading because we are only parsing out key words, and we are not taking into account the syntactical meaning of the entire review text (context) - also, there could be other words that we are not using that would dramatically alter our results and insights. To confirm our hypothesis, an additional step would be to join the product and order information of each review in order to clearly identify when an order took place and what type of product it is (this would be useful for products with a clear gender profile; date of purchase could help validate the seasonal component (Christmas...).