## Content in Amazon review data


The dataset `reviews.txt` contains [Amazon reviews](http://jmcauley.ucsd.edu/data/amazon/) for ~200k phones and phone accessories. This dataset has been "cleaned" for you. The goal of this section is to create a function that takes in the review dataset and a review and returns the word that "best summarizes the review" using TF-IDF.'

1. function `tfidf_data(review, reviews)` that takes a review as well as the review data and returns a dataframe:
    - indexed by the words in `review`,
    - with columns given by (a) the number of times each word is found in the review (`cnt`), (b) the term frequency for each word (`tf`), (c) the inverse document frequency for each word (`idf`), and (d) the TF-IDF for each word (`tfidf`).
    
2. function `relevant_word(tfidf_data)` which takes in a dataframe as above and returns the word that "best summarizes the review" described by `tfidf_data`.



In [72]:
fp = os.path.join('data', 'reviews.txt')
reviews = pd.read_csv(fp, header=None, squeeze=True)
review = open(os.path.join('data', 'review.txt'), encoding='utf8').read().strip()
review

'this is a great new case design that i have not seen before it has a slim silicone skin that really locks in the phone to cover and protect your phone from spills and such and also a hard polycarbonate outside shell cover to guard it against damage  this case also comes with different interchangeable skins and covers to create multiple color combinations  this is a different kind of case than the usual chunk of plastic  it is innovative and suits the iphone 5 perfectly'

In [75]:
cnt = pd.Series(review.split()).value_counts()
cnt.head()

and     5
a       4
this    3
is      3
to      3
dtype: int64

In [76]:
'\\b%s\\b' % 'and'

'\\band\\b'

In [359]:
df = pd.DataFrame([],index = cnt.index)
for word in cnt.index:
    pat = '\\b%s\\b' % word
    tf = len(re.findall(pat, review)) / cnt.sum()   
    idf = np.log(len(reviews) / reviews.str.contains(pat).sum())
    df.loc[word,'idf'] = idf
    df.loc[word,'tf'] = tf
    df.loc[word,'tfidf'] = tf * idf

In [360]:
df['tfidf'].idxmax()

'chunk'

In [361]:
df['cnt'] = cnt
df = df[['cnt', 'tf', 'idf', 'tfidf']]
df.head()

Unnamed: 0,cnt,tf,idf,tfidf
and,5,0.058824,0.248188,0.014599
a,4,0.047059,0.360392,0.01696
this,3,0.035294,0.441295,0.015575
it,3,0.035294,0.247858,0.008748
to,3,0.035294,0.382649,0.013505


In [77]:
def tfidf_data(review, reviews):
    """
    :Example:
    >>> fp = os.path.join('data', 'reviews.txt')
    >>> reviews = pd.read_csv(fp, header=None, squeeze=True)
    >>> review = open(os.path.join('data', 'review.txt'), encoding='utf8').read().strip()
    >>> out = tfidf_data(review, reviews)
    >>> out['cnt'].sum()
    85
    >>> 'before' in out.index
    True
    """
    cnt = pd.Series(review.split()).value_counts()
    df = pd.DataFrame([],index = cnt.index)
    for word in cnt.index:
        pat = '\\b%s\\b' % word
        tf = len(re.findall(pat, review)) / cnt.sum()   
        idf = np.log(len(reviews) / reviews.str.contains(pat).sum())
        df.loc[word,'idf'] = idf
        df.loc[word,'tf'] = tf
        df.loc[word,'tfidf'] = tf * idf
        
    df['cnt'] = cnt
    df = df[['cnt', 'tf', 'idf', 'tfidf']]
    
    return df

In [78]:
fp = os.path.join('data', 'reviews.txt')
reviews = pd.read_csv(fp, header=None, squeeze=True)
review= open(os.path.join('data', 'review.txt'), encoding='utf8').read().strip()
out = lab.tfidf_data(review, reviews)
out['cnt'].sum()

85

In [80]:
out.head()

Unnamed: 0,cnt,tf,idf,tfidf
and,5,0.058824,0.248188,0.014599
a,4,0.047059,0.360392,0.01696
this,3,0.035294,0.441295,0.015575
is,3,0.035294,0.516393,0.018226
to,3,0.035294,0.382649,0.013505


In [316]:
def relevant_word(out):
    """
    :Example:
    >>> fp = os.path.join('data', 'reviews.txt')
    >>> reviews = pd.read_csv(fp, header=None, squeeze=True)
    >>> review = open(os.path.join('data', 'review.txt'), encoding='utf8').read().strip()
    >>> out = tfidf_data(review, reviews)
    >>> relevant_word(out) in out.index
    True
    """
    return out['tfidf'].idxmax()

In [369]:
lab.relevant_word(out)

'chunk'

### Tweet Analysis: Internet Research Agency

The dataset `data/ira.csv` contains tweets tagged by Twitter as likely being posted by the *Internet Research Angency* (the tweet factory facing allegations for attempting to influence US political elections).

- We will look at the hashtags present in the text and trends in their makeup.
- We will prepare this dataset for modeling by creating features out of the text fields.


* function `hashtag_list` that takes in a column of tweet-text and returns a column containing the list of hashtags present in the tweet text. If a tweet doesn't contain a hashtag, the function should return an empty list.

* function `most_common_hashtag` that takes in a column of hashtag-lists (the output above) and returns a column consisting a single hashtag from the tweet-text. 
    - If the text has no hashtags, the entry should be `NaN`,
    - If the text has one distinct hashtag, the entry should contain that hashtag,
    - If the text has more than one hashtag, the entry should be the most common hashtag (among all hashtags in the column). If there is a tie for most common, any of the most common can be returned.
       

In [18]:
fp = os.path.join('data', 'ira.csv')
ira = pd.read_csv(fp, names=['id', 'name', 'date', 'text'])
ira.head()

Unnamed: 0,id,name,date,text
0,3906258,ea85ac8be1e8ab479064ca4c0fe3ac6587f76b1ef97452...,2016-11-16 09:04,The Best Exercise To Lose Belly Fat In 2 weeks...
1,1051443,8e58ab0f46d273103d9e71aa92cdaffb6e330ec7d15ae5...,2016-12-24 04:31,RT @Philanthropy: Dozens of ‘hate groups’ have...
2,2823399,Room Of Rumor,2016-08-18 20:26,"Artificial intelligence can find, map poverty,..."
3,272878,San Francisco Daily,2016-03-18 19:28,Uber balks at rules proposed by world’s busies...
4,7697802,41bb9ae5991f53996752a0ab8dd36b543821abca8d5aed...,2016-07-30 15:44,RT @dirtroaddiva1: #IHatePokemonGoBecause he ...


In [30]:
def hashtag_list(tweet_text):
    """
    :Example:
    >>> testdata = [['RT @DSC80: Text-cleaning is cool! #NLP https://t.co/xsfdw88d #NLP1 #NLP1']]
    >>> test = pd.DataFrame(testdata, columns=['text'])
    >>> out = hashtag_list(test['text'])
    >>> (out.iloc[0] == ['NLP', 'NLP1', 'NLP1'])
    True
    """

    col = tweet_text.apply(lambda x: re.findall(r'(?<=#)[^\s]*', x))
    return col


In [40]:
def most_common_hashtag(tweet_lists):
    """
    :Example:
    >>> testdata = [['RT @DSC80: Text-cleaning is cool! #NLP https://t.co/xsfdw88d #NLP1 #NLP1']]
    >>> test = hashtag_list(pd.DataFrame(testdata, columns=['text'])['text'])
    >>> most_common_hashtag(test).iloc[0]
    'NLP1'
    """

    hashtag_lists = tweet_lists
    hashtags = pd.Series(hashtag_lists.sum()).value_counts()
    hashtags_dic = hashtag_lists.apply(lambda x: {tag: hashtags[tag] for tag in x})
    common = hashtags_dic.apply(lambda x: pd.Series(x).idxmax() if len(x)!= 0 else np.nan)
    
    return common
    

In [84]:
testdata = [['RT @DSC80: Text-cleaning is cool! #NLP https://t.co/xsfdw88d #NLP1 #NLP1']]
test = hashtag_list(pd.DataFrame(testdata, columns=['text'])['text'])
most_common_hashtag(test).iloc[0]

'NLP1'


function `create_features` that takes in the `ira` data and returns a dataframe with the same index as `ira` (i.e. the rows correspond to the same tweets) and the following columns:
* `num_hashtags` gives the number of hashtags present in a tweet,
* `mc_hashtags` gives the most common hashtag associated to a tweet (as given by the problem above),
* `num_tags` gives the number of tags a given tweet has (look for the presence of `@`),
* `num_links` gives the number of hyper-links present in a given tweet 
    - (a hyper-link is a string starting with `http(s)://` not followed by whitespaces),
* A boolean column `is_retweet` that describes if the given tweet is a retweet (i.e. `RT`),
* A 'clean' text field `text` that contains the tweet text with:
    - The non-alphanumeric characters removed (except spaces),
    - All words should be separated by exactly one space,
    - The characters all lowercase,
    - All the meta-information above (Retweet info, tags, hyperlinks, hashtags) removed.


In [14]:
text = ira['text'].apply(lambda x: re.sub('^[RT]+\s', '', x))
text = text.apply(lambda x: re.sub('(?<=@)[^\s\:]*', '', x))
text = text.apply(lambda x: re.sub('(?<=#)[^\s]*', '', x))
text = text.apply(lambda x: re.sub('(?=http)[^\s]*', '', x))
text = text.apply(lambda x: re.sub('\W+', ' ', x))
text = text.apply(lambda x: x.strip())
text

0           The Best Exercise To Lose Belly Fat In 2 weeks
1        Dozens of hate groups have charity status Chro...
2        Artificial intelligence can find map poverty r...
3        Uber balks at rules proposed by world s busies...
4        he didn t let me do that for a Klondike bar Sc...
                               ...                        
89995             Trump Kasich shouldn t be allowed to run
89996               The last step at the top of the stairs
89997    When someone said the first link from in my ra...
89998    I speak the Word of God therefore because the ...
89999                         10 Things to Know for Monday
Name: text, Length: 90000, dtype: object

In [17]:
df = pd.DataFrame({'num_hashtags': num_hashtags,
                   'mc_hashtags': mc_hashtags,
                   'num_tags': num_tags,
                   'num_links': num_links,
                   'is_retweet': is_retweet,
                   'text': text                   
                  })
df

Unnamed: 0,num_hashtags,mc_hashtags,num_tags,num_links,is_retweet,text
0,4,CatTV,0,2,False,The Best Exercise To Lose Belly Fat In 2 weeks
1,0,,1,1,True,Dozens of hate groups have charity status Chro...
2,1,tech,0,0,False,Artificial intelligence can find map poverty r...
3,1,news,0,0,False,Uber balks at rules proposed by world s busies...
4,2,IHatePokemonGoBecause,1,1,True,he didn t let me do that for a Klondike bar Sc...
...,...,...,...,...,...,...
89995,1,politics,0,1,False,Trump Kasich shouldn t be allowed to run
89996,1,ThingsYouCantIgnore,1,0,True,The last step at the top of the stairs
89997,0,,2,1,True,When someone said the first link from in my ra...
89998,1,rantfortoday,1,0,True,I speak the Word of God therefore because the ...


In [68]:
def create_features(ira):
    """
    :Example:
    >>> testdata = [['RT @DSC80: Text-cleaning is cool! #NLP https://t.co/xsfdw88d #NLP1 #NLP1']]
    >>> test = pd.DataFrame(testdata, columns=['text'])
    >>> out = create_features(test)
    >>> anscols = ['text', 'num_hashtags', 'mc_hashtags', 'num_tags', 'num_links', 'is_retweet']
    >>> ansdata = [['text cleaning is cool', 3, 'NLP1', 1, 1, True]]
    >>> ans = pd.DataFrame(ansdata, columns=anscols)
    >>> (out == ans).all().all()
    True
    """
    hashtags = hashtag_list(ira['text'])
    num_hashtags = hashtags.apply(lambda x: len(x))
    
    hashtag_lists = hashtag_list(ira['text'])
    mc_hashtags = most_common_hashtag(hashtag_lists)

    tags = ira['text'].apply(lambda x: re.findall(r'(?<=@)[^\s\:]*',x))
    num_tags = tags.apply(lambda x: len(x))
    
    links = ira['text'].apply(lambda x: re.findall(r'(?=http)[^\s]*',x))
    num_links = links.apply(lambda x: len(x))
    
    retweet = ira['text'].apply(lambda x: re.findall(r'^RT',x))
    is_retweet = retweet.apply(lambda x: True if len(x)!= 0 else False)
    
    text = ira['text'].apply(lambda x: re.sub('^[RT]+\s', '', x))
    text = text.apply(lambda x: re.sub('(?<=@)[^\s\:]*', '', x))
    text = text.apply(lambda x: re.sub('(?<=#)[^\s]*', '', x))
    text = text.apply(lambda x: re.sub('(?=http)[^\s]*', '', x))
    text = text.apply(lambda x: re.sub('\W+', ' ', x))
    text = text.apply(lambda x: x.strip().lower())

    df = pd.DataFrame({'text': text,
                   'num_hashtags': num_hashtags,
                   'mc_hashtags': mc_hashtags,
                   'num_tags': num_tags,
                   'num_links': num_links,   
                   'is_retweet': is_retweet
                  })

    return df

In [70]:
lab.create_features(ira[:100])

Unnamed: 0,text,num_hashtags,mc_hashtags,num_tags,num_links,is_retweet
0,the best exercise to lose belly fat in 2 weeks,4,Exercise,0,2,False
1,dozens of hate groups have charity status chro...,0,,1,1,True
2,artificial intelligence can find map poverty r...,1,tech,0,0,False
3,uber balks at rules proposed by world s busies...,1,news,0,0,False
4,he didn t let me do that for a klondike bar sc...,2,IHatePokemonGoBecause,1,1,True
...,...,...,...,...,...,...
95,malcolm revisited,0,,0,1,False
96,the frightening image that made me turn and run,1,IWouldPreferToForget,1,1,True
97,man sought for illegal sexual contact with minor,0,,0,0,False
98,so satisfying,0,,0,1,False
