# DSC 80: Lab 07

### Due Date: Saturday, November 21 11:59PM

## Instructions
Much like in DSC 10, this Jupyter Notebook contains the statements of the problems and provides code and markdown cells to display your answers to the problems. Unlike DSC 10, the notebook is *only* for displaying a readable version of your final answers. The coding work will be developed in an accompanying `lab*.py` file, that will be imported into the current notebook.

Labs and programming assignments will be graded in (at most) two ways:
1. The functions and classes in the accompanying python file will be tested (a la DSC 20),
2. The notebook will be graded (for graphs and free response questions).

**Do not change the function names in the `*.py` file**
- The functions in the `*.py` file are how your assignment is graded, and they are graded by their name. The dictionary at the end of the file (`GRADED FUNCTIONS`) contains the "grading list". The final function in the file allows your doctests to check that all the necessary functions exist.
- If you changed something you weren't supposed to, just use git to revert!

**Tips for working in the Notebook**:
- The notebooks serve to present you the questions and give you a place to present your results for later review.
- The notebook on *lab assignments* are not graded (only the `.py` file).
- Notebooks for PAs will serve as a final report for the assignment, and contain conclusions and answers to open ended questions that are graded.
- The notebook serves as a nice environment for 'pre-development' and experimentation before designing your function in your `.py` file.

**Tips for developing in the .py file**:
- Do not change the function names in the starter code; grading is done using these function names.
- Do not change the docstrings in the functions. These are there to tell you if your work is on the right track!
- You are encouraged to write your own additional functions to solve the lab! 
    - Developing in python usually consists of larger files, with many short functions.
    - You may write your other functions in an additional `.py` file that you import in `lab**.py` (much like we do in the notebook).
- Always document your code!

### Importing code from `lab**.py`

* We import our `.py` file that's contained in the same directory as this notebook.
* We use the `autoreload` notebook extension to make changes to our `lab**.py` file immediately available in our notebook. Without this extension, we would need to restart the notebook kernel to see any changes to `lab**.py` in the notebook.
    - `autoreload` is necessary because, upon import, `lab**.py` is compiled to bytecode (in the directory `__pycache__`). Subsequent imports of `lab**` merely import the existing compiled python.

In [3]:
%load_ext autoreload
%autoreload 2

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


In [4]:
import lab07 as lab

In [5]:
%matplotlib inline
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import glob
import os
import time
import re

In [6]:
import requests
import json

# Practice with regular expressions (Regex)

**Question 1**

You start with some basic regular expression exercises to get some practice using them. You will find function stubs and related doctests in the starter code. 

**Exercise 1:** A string that has a `[` as the third character and `]` as the sixth character.

**Exercise 2:** Phone numbers that start with '(858)' and follow the format '(xxx) xxx-xxxx' (x represents a digit).

*Notice: There is a space between (xxx) and xxx-xxxx*

**Exercise 3:** A string whose length is between 6 to 10 and contains only word characters, white spaces and `?`. This string must have `?` as its last character.

**Exercise 4:** A string that begins with '\\$' and with another '\\$' within, where:
   - Characters between the two '\\$' can be anything (including nothing) except the letters 'a', 'b', 'c' (lower case).
   - Characters after the second '\\$' can only have any number of the letters 'a', 'b', 'c' (upper or lower case), with every 'a' before every 'b', and every 'b' before every 'c'.
       - E.g. 'AaBbbC' works, 'ACB' doesn't.

**Exercise 5:** A string that represents a valid Python file name including the extension. 

*Notice*: For simplicity, assume that the file name contains only letters, numbers and an underscore `_`.

**Exercise 6:** Find patterns of lowercase letters joined with an underscore.

**Exercise 7:** Find patterns that start with and end with a `_`.

**Exercise 8:**  Apple registration numbers and Apple hardware product serial numbers might have the number '0' (zero), but never the letter 'O'. Serial numbers don't have the number '1' (one) or the letter 'i'. Write a line of regex expression that checks if the given Serial number belongs to a genuine Apple product.

**Exercise 9:** Check if a given ID number is from Los Angeles (LAX), San Diego(SAN) or the state of New York (NY). ID numbers have the following format `SC-NN-CCC-NNNN`. 
   - SC represents state code in uppercase 
   - NN represents a number with 2 digits 
   - CCC represents a three letter city code in uppercase
   - NNNN represents a number with 4 digits

**Exercise 10:**  Given an input string, cast it to lower case, remove spaces/punctuation, and return a list of every 3-character substring following this logic:
   - The first character doesn't start with 'a' or 'A'
   - The last substring (and only the last substring) can be shorter than 3 characters, depending on the length of the input string.
   - The substrings cannot overlap
   
Here's an example with one of the doctests:

`>>> match_10("Ab..DEF")`
`['def']`

1. convert it to a lowercase string resulting in "ab..def"
2. delete any 3 letter sequence that starts with the letter 'a', so delete "ab." from the string, leaving using with ".def"
3. delete the punctuation resulting in "def"
4. finally, we get `["def"]`

(Only split in the last step, everything else is removing from the string)

In [7]:
match_3("qw?ertsd?")

NameError: name 'match_3' is not defined

In [10]:
def match_1(string):
    """
    >>> match_1("abcde]")
    False
    >>> match_1("ab[cde")
    False
    >>> match_1("a[cd]")
    False
    >>> match_1("ab[cd]")
    True
    >>> match_1("1ab[cd]")
    False
    >>> match_1("ab[cd]ef")
    True
    >>> match_1("1b[#d] _")
    True
    """
    #Your Code Here
    pattern = '^..\[..\]'

    #Do not edit following code
    prog = re.compile(pattern)
    return prog.search(string) is not None



def match_2(string):
    """
    Phone numbers that start with '(858)' and
    follow the format '(xxx) xxx-xxxx' (x represents a digit)
    Notice: There is a space between (xxx) and xxx-xxxx

    >>> match_2("(123) 456-7890")
    False
    >>> match_2("858-456-7890")
    False
    >>> match_2("(858)45-7890")
    False
    >>> match_2("(858) 456-7890")
    True
    >>> match_2("(858)456-789")
    False
    >>> match_2("(858)456-7890")
    False
    >>> match_2("a(858) 456-7890")
    False
    >>> match_2("(858) 456-7890b")
    False
    """
    #Your Code Here
    pattern = '^\((858)\) \d{3}-\d{4}$'

    #Do not edit following code
    prog = re.compile(pattern)
    return prog.search(string) is not None




def match_3(string):
    """
    Find a pattern whose length is between 6 to 10
    and contains only word character, white space and ?.
    This string must have ? as its last character.

    >>> match_3("qwertsd?")
    True
    >>> match_3("qw?ertsd?")
    True
    >>> match_3("ab c?")
    False
    >>> match_3("ab   c ?")
    True
    >>> match_3(" asdfqwes ?")
    False
    >>> match_3(" adfqwes ?")
    True
    >>> match_3(" adf!qes ?")
    False
    >>> match_3(" adf!qe? ")
    False
    """
    #Your Code Here

    pattern = '^[a-zA-Z ?]{5,9}\?$'

    #Do not edit following code
    prog = re.compile(pattern)
    return prog.search(string) is not None


def match_4(string):
    """
    A string that begins with '$' and with another '$' within, where:
        - Characters between the two '$' can be anything except the 
        letters 'a', 'b', 'c' (lower case).
        - Characters after the second '$' can only have any number 
        of the letters 'a', 'b', 'c' (upper or lower case), with every 
        'a' before every 'b', and every 'b' before every 'c'.
            - E.g. 'AaBbbC' works, 'ACB' doesn't.

    >>> match_4("$$AaaaaBbbbc")
    True
    >>> match_4("$!@#$aABc")
    True
    >>> match_4("$a$aABc")
    False

    >>> match_4("$iiuABc")
    False
    >>> match_4("123$Abc")
    False
    >>> match_4("$$Abc")
    True
    >>> match_4("$qw345t$AAAc")
    False
    >>> match_4("$s$Bca")
    False
    """
    #Your Code Here
    pattern = '^\$[^a-c]*\$[aA]+[bB]+[cC]+$'

    #Do not edit following code
    prog = re.compile(pattern)
    return prog.search(string) is not None


def match_5(string):
    """
    A string that represents a valid Python file name including the extension.
    *Notice*: For simplicity, assume that the file name contains only letters, numbers and an underscore `_`.

    >>> match_5("dsc80.py")
    True
    >>> match_5("dsc80py")
    False
    >>> match_5("dsc80..py")
    False
    >>> match_5("dsc80+.py")
    False
    """

    #Your Code Here
    pattern = '\w+\.py$'

    #Do not edit following code
    prog = re.compile(pattern)
    return prog.search(string) is not None


def match_6(string):
    """
    Find patterns of lowercase letters joined with an underscore.
    >>> match_6("aab_cbb_bc")
    False
    >>> match_6("aab_cbbbc")
    True
    >>> match_6("aab_Abbbc")
    False
    >>> match_6("abcdef")
    False
    >>> match_6("ABCDEF_ABCD")
    False
    """

    #Your Code Here
    pattern = '^[a-z]+_[a-z]+$'

    #Do not edit following code
    prog = re.compile(pattern)
    return prog.search(string) is not None



def match_7(string):
    """
    Find patterns that start with and end with a _
    >>> match_7("_abc_")
    True
    >>> match_7("abd")
    False
    >>> match_7("bcd")
    False
    >>> match_7("_ncde")
    False
    """

    pattern = '^_.*_$'

    #Do not edit following code
    prog = re.compile(pattern)
    return prog.search(string) is not None




def match_8(string):
    """
    Apple registration numbers and Apple hardware product serial numbers
    might have the number "0" (zero), but never the letter "O".
    Serial numbers don't have the number "1" (one) or the letter "i".

    Write a line of regex expression that checks
    if the given Serial number belongs to a genuine Apple product.

    >>> match_8("ASJDKLFK10ASDO")
    False
    >>> match_8("ASJDKLFK0ASDo")
    True
    >>> match_8("JKLSDNM01IDKSL")
    False
    >>> match_8("ASDKJLdsi0SKLl")
    False
    >>> match_8("ASDJKL9380JKAL")
    True
    """

    pattern = '^[^OiI1]*$'

    #Do not edit following code
    prog = re.compile(pattern)
    return prog.search(string) is not None



def match_9(string):
    '''
    >>> match_9('NY-32-NYC-1232')
    True
    >>> match_9('ca-23-SAN-1231')
    False
    >>> match_9('MA-36-BOS-5465')
    False
    >>> match_9('CA-56-LAX-7895')
    True
    '''


    pattern = '^(NY)-\d{2}-(NYC)-\d{4}$|^(CA)-\d{2}-(SAN)-\d{4}$|^(CA)-\d{2}-(LAX)-\d{4}$'

    #Do not edit following code
    prog = re.compile(pattern)
    return prog.search(string) is not None


def match_10(string):
    '''
    Given an input string, cast it to lower case, remove spaces/punctuation, 
    and return a list of every 3-character substring that satisfy the following:
        - The first character doesn't start with 'a' or 'A'
        - The last substring (and only the last substring) can be shorter than 
        3 characters, depending on the length of the input string.
    
    >>> match_10('ABCdef')
    ['def']
    >>> match_10(' DEFaabc !g ')
    ['def', 'cg']
    >>> match_10('Come ti chiami?')
    ['com', 'eti', 'chi']
    >>> match_10('and')
    []
    >>> match_10( "Ab..DEF")
    ['def']

    '''
    
    fixed = re.sub(' ', '',string.lower())
    substrings = re.findall('\S{1,3}',fixed)
    new_string = ''
    for substring in substrings:
        temp = re.search('^[^aA]{1}\S*$',substring)
        if temp != None:
            no_punc = re.sub('[^\w]','',temp[0])
            new_string+=no_punc
    out = re.findall('\S{1,3}',new_string)
    return out

In [11]:
match_10("Ab..DEF")
    

['def']

In [9]:
x = 'isKKJ hfku'
x.lower()

'iskkj hfku'

## Regex groups: extracting personal information from messy data

**Question 2**

The file in `data/messy.txt` contains personal information from a fictional website that a user scraped from webserver logs. Within this dataset, there are four fields that interest you:
1. Email Addresses (assume they are alphanumeric user-names and domain-names),
2. [Social Security Numbers](https://en.wikipedia.org/wiki/Social_Security_number#Structure)
3. Bitcoin Addresses (alpha-numeric strings of long length)
4. Street Addresses

Create a function `extract_personal` that takes in a string like `open('data/messy.txt').read()` and returns a tuple of four separate lists containing values of the 4 pieces of information listed above (in the order given). Do **not** keep empty values.

*Hint*: There are multiple "delimiters" in use in the file; there are few enough of them that you can safely determine what they are.

*Note:* Since this data is messy/corrupted, your function will be allowed to miss ~5% of the records in each list. Good spot checking using certain useful substrings (e.g. `@` for emails) should help assure correctness! Your function will be tested on a sample of the file `messy.txt`.

In [15]:
fp = os.path.join('data', 'messy.txt')
s = open(fp, encoding='utf8').read()

In [16]:
s[:1000]

'1\t4/12/2018\tLorem ipsum dolor sit amet, consectetuer adipiscing elit. Proin risus. Praesent lectus.\n\nVestibulum quam sapien| varius ut, blandit non, interdum in, ante. Vestibulum ante ipsum primis in faucibus orci luctus et ultrices posuere cubilia Curae; Duis faucibus accumsan odio. Curabitur convallis.|dottewell0@gnu.org\toR1mOq,!@#$%^&*(),[{bitcoin:18A8rBU3wvbLTSxMjqrPNc9mvonpA4XMiv\tIP:192.232.9.210\tccn:3563354617955160|ssn:380-09-9403}]|05-6609813,814 Monterey Court\n2\t12/18/2018\tSuspendisse potenti. In eleifend quam a odio. In hac habitasse platea dictumst.\n\nMaecenas ut massa quis augue luctus tincidunt. Nulla mollis molestie lorem. Quisque ut erat.,bassiter1@sphinn.com\tc5KvmarHX3o,test\u2060test\u202b,[{bitcoin:1EB7kYpnfJSqS7kUFpinsmPF3uiH9sfRf1,IP:20.73.13.197|ccn:3542723823957010\tssn:118-12-8276}#{bitcoin:1E5fev4boabWZmXvHGVkHcNJZ2tLnpM6Zv*IP:238.206.212.148\tccn:337941898369615,ssn:427-22-9352}#{bitcoin:1DqG3WcmGw74PjptjzcAmxGFuQdvWL7RCC,IP:171.241.15.98\tccn:3574

In [17]:
def extract_personal(s):
    """
    :Example:
    >>> fp = os.path.join('data', 'messy.test.txt')
    >>> s = open(fp, encoding='utf8').read()
    >>> emails, ssn, bitcoin, addresses = extract_personal(s)
    >>> emails[0] == 'test@test.com'
    True
    >>> ssn[0] == '423-00-9575'
    True
    >>> bitcoin[0] == '1BvBMSEYstWetqTFn5Au4m4GFg7xJaNVN2'
    True
    >>> addresses[0] == '530 High Street'
    True
    """
    def remove_pre(strings, pre):
        helper = lambda x : x.replace(pre,'')
        return list(map(helper,strings))
    email_pattern = '[a-zA-Z0-9]+@[a-zA-Z0-9]+\.[a-z]{3}'
    emails = re.findall(email_pattern,s)
    ssn_pattern = 'ssn:\d{3}-\d{2}-\d{4}'
    ssn = re.findall(ssn_pattern, s)
    ssn_fixed = remove_pre(ssn,'ssn:')
    fil1 = lambda x : x[:3] not in ['000','666']
    fil2 = lambda x : x[:1] != '9'
    ssn_final = list(filter(fil2,list(filter(fil1,ssn_fixed))))
    bits_pattern = 'bitcoin:[a-zA-Z0-9]{5,}'
    bits = re.findall(bits_pattern,s)
    bits_fixed = remove_pre(bits,'bitcoin:')
    adds_pattern = '\d+ [a-zA-z]+ [a-zA-z]+'
    adds = re.findall(adds_pattern,s)
    return tuple([emails,ssn_final,bits_fixed,adds])

In [18]:
fp = os.path.join('data', 'messy.txt')
s = open(fp, encoding='utf8').read()
emails, ssn, bitcoin, addresses= extract_personal(s)

In [19]:
ends = pd.Series(addresses)

In [20]:
ends =pd.Series([i[:3] for i in ssn])

In [21]:
np.sort(ends.unique())


array(['100', '101', '102', '103', '104', '105', '106', '107', '108',
       '109', '110', '111', '112', '114', '115', '116', '117', '118',
       '119', '120', '121', '123', '124', '125', '126', '127', '128',
       '129', '130', '131', '132', '133', '134', '135', '136', '137',
       '138', '139', '140', '142', '143', '144', '145', '146', '147',
       '148', '150', '151', '152', '153', '154', '155', '156', '157',
       '158', '159', '160', '162', '163', '164', '166', '167', '168',
       '169', '170', '171', '172', '173', '174', '175', '176', '177',
       '178', '180', '181', '182', '183', '184', '185', '186', '187',
       '188', '189', '190', '191', '192', '193', '194', '195', '196',
       '197', '198', '199', '200', '201', '202', '203', '204', '205',
       '206', '207', '208', '209', '210', '211', '212', '213', '214',
       '215', '216', '217', '218', '219', '220', '221', '222', '223',
       '224', '225', '226', '227', '228', '229', '230', '231', '232',
       '233', '234',

## Content in Amazon review data

**Question 3**

The dataset `reviews.txt` contains [Amazon reviews](http://jmcauley.ucsd.edu/data/amazon/) for ~200k phones and phone accessories. This dataset has been "cleaned" for you. The goal of this section is to create a function that takes in the review dataset and a review and returns the word that "best summarizes the review" using TF-IDF.'

1. Create a function `tfidf_data(review, reviews)` that takes a review as well as the review data and returns a dataframe:
    - indexed by the words in `review`,
    - with columns given by (a) the number of times each word is found in the review (`cnt`), (b) the term frequency for each word (`tf`), (c) the inverse document frequency for each word (`idf`), and (d) the TF-IDF for each word (`tfidf`).
    
2. Create a function `relevant_word(tfidf_data)` which takes in a dataframe as above and returns the word that "best summarizes the review" described by `tfidf_data`.


*Note:* Use this function to "cluster" review types -- run it on a sample of reviews and see which words come up most. Unfortunately, you will likely have to change your code from your answer above to run it on the entire dataset (to do this, you should compute as many of the frequencies "ahead of time" and look them up when needed; you should also likely filter out words that occur "rarely")

In [22]:
fp = os.path.join('data', 'reviews.txt')
reviews = pd.read_csv(fp, header=None, squeeze=True)
review = open(os.path.join('data', 'review.txt'), encoding='utf8').read().strip()

In [23]:
def tfidf_data(review, reviews):
    """
    :Example:
    >>> fp = os.path.join('data', 'reviews.txt')
    >>> reviews = pd.read_csv(fp, header=None, squeeze=True)
    >>> review = open(os.path.join('data', 'review.txt'), encoding='utf8').read().strip()
    >>> out = tfidf_data(review, reviews)
    >>> out['cnt'].sum()
    85
    >>> 'before' in out.index
    True
    """
    df = pd.DataFrame(pd.DataFrame({'words':re.findall('\S+',review)}).groupby('words').size()).rename(columns = {0:'cnt'})
    df['tf'] = df['cnt'] / df['cnt'].sum()
    idfs = []
    for word in df.index:
        pat = '\\b%s\\b' % word
        idf = np.log((len(reviews)/reviews.str.contains(pat).sum()))
        idfs.append(idf)
    df['idf'] = idfs
    df['tfidf'] = df['tf'] * df['idf']
    return df


def relevant_word(out):
    """
    :Example:
    >>> fp = os.path.join('data', 'reviews.txt')
    >>> reviews = pd.read_csv(fp, header=None, squeeze=True)
    >>> review = open(os.path.join('data', 'review.txt'), encoding='utf8').read().strip()
    >>> out = tfidf_data(review, reviews)
    >>> relevant_word(out) in out.index
    True
    """
    return out['tfidf'].idxmax()

In [624]:
out =tfidf_data(review, reviews)
relevant_word(out)

'chunk'

In [632]:
for i in out.index:
    print(i)

5
a
against
also
and
before
case
chunk
color
combinations
comes
cover
covers
create
damage
design
different
from
great
guard
hard
has
have
i
in
innovative
interchangeable
iphone
is
it
kind
locks
multiple
new
not
of
outside
perfectly
phone
plastic
polycarbonate
protect
really
seen
shell
silicone
skin
skins
slim
spills
such
suits
than
that
the
this
to
usual
with
your


### Tweet Analysis: Internet Research Agency

The dataset `data/ira.csv` contains tweets tagged by Twitter as likely being posted by the *Internet Research Angency* (the tweet factory facing allegations for attempting to influence US political elections).

The questions in this section will focus on the following:
1. We will look at the hashtags present in the text and trends in their makeup.
2. We will prepare this dataset for modeling by creating features out of the text fields.

**Question 4 (HashTags)**

You may assume that a hashtag is any string without whitespace following a `#` (this is more permissive than Twitters rules for hashtags; you are encouraged to go down this rabbit-hole to better figure out how to clean your data!).

* Create a function `hashtag_list` that takes in a column of tweet-text and returns a column containing the list of hashtags present in the tweet text. If a tweet doesn't contain a hashtag, the function should return an empty list.

* Create a function `most_common_hashtag` that takes in a column of hashtag-lists (the output above) and returns a column consisting a single hashtag from the tweet-text. 
    - If the text has no hashtags, the entry should be `NaN`,
    - If the text has one distinct hashtag, the entry should contain that hashtag,
    - If the text has more than one hashtag, the entry should be the most common hashtag (among all hashtags in the column). If there is a tie for most common, any of the most common can be returned.
        - E.g. if the input column was: `pd.Series([[1, 2, 2], [3, 2, 3]])`, the output would be: `pd.Series([2, 2])`. Even though `3` was more common in the second list, `2` is the most common among all hashtags in the column.

In [14]:
fp = os.path.join('data', 'ira.csv')
ira = pd.read_csv(fp, names=['id', 'name', 'date', 'text'])
ira

Unnamed: 0,id,name,date,text
0,3906258,ea85ac8be1e8ab479064ca4c0fe3ac6587f76b1ef97452...,2016-11-16 09:04,The Best Exercise To Lose Belly Fat In 2 weeks...
1,1051443,8e58ab0f46d273103d9e71aa92cdaffb6e330ec7d15ae5...,2016-12-24 04:31,RT @Philanthropy: Dozens of ‘hate groups’ have...
2,2823399,Room Of Rumor,2016-08-18 20:26,"Artificial intelligence can find, map poverty,..."
3,272878,San Francisco Daily,2016-03-18 19:28,Uber balks at rules proposed by world’s busies...
4,7697802,41bb9ae5991f53996752a0ab8dd36b543821abca8d5aed...,2016-07-30 15:44,RT @dirtroaddiva1: #IHatePokemonGoBecause he ...
...,...,...,...,...
89995,5635647,KansasCityDailyNews,2016-04-03 21:19,Trump: Kasich shouldn't be allowed to run http...
89996,7012979,f46e654ff3f1f9697f2b94de5a2e42a6914e1f00da14a7...,2016-12-19 15:04,RT @JefLeeson: #ThingsYouCantIgnore The last s...
89997,6955774,88669ad69e40d7c199af91e8107f1e0e7988d377d2e41f...,2016-08-27 14:15,RT @nealcarter: When someone said the first li...
89998,8563509,ec2109adb67d2a24091026d5d9aab64dadca1fdb2f7355...,2016-10-20 15:19,RT @indigenous01: #rantfortoday I speak the Wo...


In [45]:
def hashtag_list(tweet_text):
    """
    :Example:
    >>> testdata = [['RT @DSC80: Text-cleaning is cool! #NLP https://t.co/xsfdw88d #NLP1 #NLP1']]
    >>> test = pd.DataFrame(testdata, columns=['text'])
    >>> out = hashtag_list(test['text'])
    >>> (out.iloc[0] == ['NLP', 'NLP1', 'NLP1'])
    True
    """
    pattern = '#\S+'
    out = []
    def get_hashtags(string):
        hash_list = pd.Series(re.findall(pattern,string))
        return list(map(lambda x : x.replace('#', ''),hash_list))
    out = tweet_text.apply(get_hashtags)
    return out


def most_common_hashtag(tweet_lists):
    """
    :Example:
    >>> testdata = [['RT @DSC80: Text-cleaning is cool! #NLP https://t.co/xsfdw88d #NLP1 #NLP1']]
    >>> test = hashtag_list(pd.DataFrame(testdata, columns=['text'])['text'])
    >>> most_common_hashtag(test).iloc[0]
    'NLP1'
    """
    ttl_list = pd.Series(tweet_lists.sum())
    count_df = ttl_list.value_counts()
    def helper(hashtags):
        if len(hashtags) == 0:
            return np.nan
        elif len(hashtags) == 1:
            return hashtags[0]
        else:
            return count_df.loc[hashtags].idxmax()
    return tweet_lists.apply(helper)
    

In [46]:
#testdata = [['RT @DSC80: Text-cleaning is cool! #NLP https://t.co/xsfdw88d #NLP1 #NLP1']]
test = pd.DataFrame(ira, columns=['text'])
out = hashtag_list(test['text'].iloc[:20])
new =most_common_hashtag(out)

  hash_list = pd.Series(re.findall(pattern,string))


In [47]:
new


0                  Exercise
1                       NaN
2                      tech
3                      news
4     IHatePokemonGoBecause
5                    health
6                       NaN
7      IWouldPreferToForget
8               NowPlaying:
9          AthleticsTVShows
10     HillaryRottenClinton
11                      NaN
12              TrumpTapes.
13            entertainment
14                      NaN
15                      NaN
16                      NaN
17                      NaN
18                   sports
19                 politics
Name: text, dtype: object

In [48]:
lis = pd.DataFrame([['a','b'],['b','c']])
#pd.DataFrame({'hash':lis}).groupby('hash').count()
cnt = lambda x :len(x)
ttl = lis.apply(cnt).sum()
new = []
def hel(lis):
    new.extend(lis)
lis.applymap(hel)
new

['a', 'b', 'a', 'b', 'b', 'c']

**Question 5 (Features)**

Now create a dataframe of features from the `ira` data.  That is create a function `create_features` that takes in the `ira` data and returns a dataframe with the same index as `ira` (i.e. the rows correspond to the same tweets) and the following columns:
* `num_hashtags` gives the number of hashtags present in a tweet,
* `mc_hashtags` gives the most common hashtag associated to a tweet (as given by the problem above),
* `num_tags` gives the number of tags a given tweet has (look for the presence of `@`),
* `num_links` gives the number of hyper-links present in a given tweet 
    - (a hyper-link is a string starting with `http(s)://` not followed by whitespaces),
* A boolean column `is_retweet` that describes if the given tweet is a retweet (i.e. `RT`),
* A 'clean' text field `text` that contains the tweet text with:
    - The non-alphanumeric characters removed (except spaces),
    - All words should be separated by exactly one space,
    - The characters all lowercase,
    - All the meta-information above (Retweet info, tags, hyperlinks, hashtags) removed.

*Note:* You should make a helper function for each column.

*Note:* This will take a while to run on the entire dataset -- test it on a small sample first!

In [91]:
def create_features(ira):
    """
    :Example:
    >>> testdata = [['RT @DSC80: Text-cleaning is cool! #NLP https://t.co/xsfdw88d #NLP1 #NLP1']]
    >>> test = pd.DataFrame(testdata, columns=['text'])
    >>> out = create_features(test)
    >>> anscols = ['text', 'num_hashtags', 'mc_hashtags', 'num_tags', 'num_links', 'is_retweet']
    >>> ansdata = [['text cleaning is cool', 3, 'NLP1', 1, 1, True]]
    >>> ans = pd.DataFrame(ansdata, columns=anscols)
    >>> (out == ans).all().all()
    True
    """
    tweet_text = ira['text']
    hash_list = hashtag_list(tweet_text.copy())
    num_hashtags = hash_list.apply(lambda x: len(x))
    mc_hashtags = most_common_hashtag(hash_list)
    def count_tags(string):
        pattern = '@\S+'
        return len(re.findall(pattern,string))
    num_tags = tweet_text.copy().apply(count_tags)
    def count_links(string):
        pattern = 'http(s)*://\S+'
        return len(re.findall(pattern,string))
    num_links = tweet_text.copy().apply(count_links)
    def retweet(string):
        pattern = 'RT'
        if re.search(pattern, string) == None:
            return False
        else:
            return True
    is_retweet = tweet_text.copy().apply(retweet)
    def clean(string):
        temp = re.sub('RT|http(s)*://\S+|@\S+|#\S+','',string)
        words = ' '.join(re.findall('[a-zA-Z0-9]+',temp.lower()))
        return words
    text = tweet_text.copy().apply(clean)
    df = pd.DataFrame({'text':text,'num_hashtags':num_hashtags,'mc_hashtags':mc_hashtags,'num_tags':num_tags,
                         'num_links':num_links,'is_retweet':is_retweet})
    df.index = ira.index
    return df

In [96]:
testdata = [['RT @DSC80: Text-cleaning is cool! #NLP https://t.co/xsfdw88d #NLP1 #NLP1']]
test = pd.DataFrame(testdata, columns=['text'])
out = create_features(ira.head(20))
anscols = ['text', 'num_hashtags', 'mc_hashtags', 'num_tags', 'num_links', 'is_retweet']
ansdata = [['text cleaning is cool', 3, 'NLP1', 1, 1, True]]
ans = pd.DataFrame(ansdata, columns=anscols)
out

  hash_list = pd.Series(re.findall(pattern,string))


Unnamed: 0,text,num_hashtags,mc_hashtags,num_tags,num_links,is_retweet
0,the best exercise to lose belly fat in 2 weeks,4,Exercise,0,2,False
1,dozens of hate groups have charity status chro...,0,,1,1,True
2,artificial intelligence can find map poverty r...,1,tech,0,0,False
3,uber balks at rules proposed by world s busies...,1,news,0,0,False
4,he didn t let me do that for a klondike bar sc...,2,IHatePokemonGoBecause,1,1,True
5,chick fil a remains closed after health violat...,1,health,0,0,False
6,we cannot afford to wait to address this publi...,0,,1,1,True
7,that the two leading republican candidates are...,1,IWouldPreferToForget,1,0,True
8,rj ommio from nothing prod by davo,4,NowPlaying:,1,1,True
9,hill street vida blues,1,AthleticsTVShows,1,0,False


In [97]:
out['text']

0        the best exercise to lose belly fat in 2 weeks
1     dozens of hate groups have charity status chro...
2     artificial intelligence can find map poverty r...
3     uber balks at rules proposed by world s busies...
4     he didn t let me do that for a klondike bar sc...
5     chick fil a remains closed after health violat...
6     we cannot afford to wait to address this publi...
7     that the two leading republican candidates are...
8                    rj ommio from nothing prod by davo
9                                hill street vida blues
10                 all you wanted to know about hillary
11    photos man driving atv hit by semi truck while...
12    you don t have to use your daughters and wives...
13    celebrity biographer wendy leigh dies in londo...
14    48 citizens escape horrible acts of terrorists...
15    its remaining divisions will be assigned to a ...
16                                video 2fic the mantle
17    hillary s campaign manager deleted all his

## Congratulations! You're done!

* Submit the lab on Gradescope