In [1]:
# Initialize Otter
import otter
grader = otter.Notebook("lab.ipynb")

# Lab 7 – Regular Expressions

## DSC 80, Spring 2024

### Due Date: Wednesday, May 22th at 11:59 PM

## Instructions

Welcome to the seventh DSC 80 lab this quarter!

Much like in DSC 10, this Jupyter Notebook contains the statements of the problems and provides code and Markdown cells to display your answers to the problems. Unlike DSC 10, the notebook is *only* for displaying a readable version of your final answers. The coding will be done in an accompanying `lab.py` file that is imported into the current notebook, and **you will only submit that `lab.py` file**, not this notebook!

Some additional guidelines:
- **Unlike in DSC 10, labs will have both public tests and hidden tests.** The bulk of your grade will come from your scores on hidden tests, which you will only see on Gradescope after the assignment deadline.
- **Do not change the function names in the `lab.py` file!** The functions in the `lab.py` file are how your assignment is graded, and they are graded by their name. If you changed something you weren't supposed to, you can find the original code in the [course GitHub repository](https://github.com/dsc-courses/dsc80-2024-sp).
- Notebooks are nice for testing and experimenting with different implementations before designing your function in your `lab.py` file. You can write code here, but make sure that all of your real work is in the `lab.py` file, since that's all you're submitting.
- You are encouraged to write your own additional helper functions to solve the lab, as long as they also end up in `lab.py`.

**To ensure that all of the work you want to submit is in `lab.py`, we've included a script named `lab-validation.py` in the lab folder. You shouldn't edit it, but instead, you should call it from the command line (e.g. the Terminal) to test your work.** More details on its usage are given at the bottom of this notebook.

**Importing code from `lab.py`**:

* Below, we import the `.py` file that's contained in the same directory as this notebook.
* We use the `autoreload` notebook extension to make changes to our `lab.py` file immediately available in our notebook. Without this extension, we would need to restart the notebook kernel to see any changes to `lab.py` in the notebook.
    - `autoreload` is necessary because, upon import, `lab.py` is compiled to bytecode (in the directory `__pycache__`). Subsequent imports of `lab` merely import the existing compiled python.

In [2]:
%load_ext autoreload
%autoreload 2

In [3]:
from lab import *

In [4]:
import pandas as pd
import numpy as np
import os
import re

## Question 1 – Practice with Regular Expressions 🛠
<div class="alert alert-block alert-warning">
<b>Note</b>: The functions in this question all have doctests in their docstrings. The doctests constitute the public tests; we will still run hidden tests on each of your functions. <b>Please don't change any of the docstrings!</b>
</div>

Regular expressions can be tricky, and the best way to gain familiarity with them is through lots of practice. In this question, you will work through ten exercises, each of which requires you to write a regular expression that matches strings that satisfy certain criteria. Make sure to take a close look at the doctests for each function in `lab.py`, as they provide useful guidance for the types of strings you should and shouldn't match.

***Notes:*** 
- Make sure to refer to the [Regular Expression Resources](https://dsc80.com/resources/#regular-expressions) posted on the course website. In particular, we recommend having [regex101.com](https://regex101.com/) open while working, along with the [cheat sheet](https://dsc80.com/resources/other/berkeley-regex-reference.pdf).

- Each exercise has a star rating, between 1 (⭐️) and 3 (⭐️⭐️⭐️) stars, indicating its difficulty level (1 being the easiest, 3 being the hardest). If you are spending lots of time on 1-star exercises, take a close look at the syntax from lecture, as there is probably an easier way of writing the necessary pattern!

<br>

### Exercise 1 (⭐️)

Write a regular expression that matches strings that have `'['` as the third character and `']'` as the sixth character.

<br>

### Exercise 2 (⭐️)

Write a regular expression that matches strings that are phone numbers that start with `'(858)'` and follow the format `'(xxx) xxx-xxxx'` (`'x'` represents a digit).

***Note:*** There is a space between `'(xxx)'` and `'xxx-xxxx'`.

<br>

### Exercise 3 (⭐️)

Write a regular expression that matches strings that:
- are between 6 and 10 characters long (inclusive),
- contain only alphanumeric characters, whitespace and `'?'`, and
- end with `'?'`.

<br>

### Exercise 4 (⭐️⭐️)

Write a regular expression that matches strings with exactly two `'$'`, one of which is at the start of the string, such that:
- the characters between the two `'$'` can be anything (including nothing) except the lowercase letters `'a'`, `'b'`, and `'c'`, (and `'$'`), and
- the characters after the second `'$'` can only be the **lowercase or uppercase** letters `'a'`/`'A'`, `'b'`/`'B'`, and `'c'`/`'C'`, with every `'a'`/`'A'` before every `'b'`/`'B'`, and every `'b'`/`'B'` before every `'c'`/`'C'`. There must be at least one `'a'` or `'A'`, at least one `'b'` or `'B'`, and at least one `'c'` or `'C'`.
    

<br>

### Exercise 5 (⭐️)
Write a regular expression that matches strings that represent valid Python file names, including the extension. 

***Note:*** For simplicity, assume that file names only contain letters, numbers, and underscores (`'_'`).

<br>

### Exercise 6 (⭐️)
Write a regular expression that matches strings that:
- are made up of only lowercase letters and exactly one underscore (`'_'`), and
- have at least one lowercase letter on both sides of the underscore.

<br>

### Exercise 7 (⭐️)
Write a regular expression that matches strings that start with and end with an underscore (`'_'`).

<br>

### Exercise 8 (⭐️)

Apple serial numbers are strings of length 1 or more that are made up of any characters, other than
- the uppercase letter `'O'`, 
- the lowercase letter `'i`', and 
- the number `'1'`.

Write a regular expression that matches strings that are valid Apple serial numbers.

<br>

### Exercise 9 (⭐️⭐️)

ID numbers are formatted as `'SC-NN-CCC-NNNN'`, where 
- SC represents state code in uppercase (e.g. `'CA'`),
- NN represents a number with 2 digits (e.g. `'98'`),
- CCC represents a three letter city code in uppercase (e.g. `'SAN'`), and
- NNNN represents a number with 4 digits (e.g. `'1024'`).

Write a regular expression that matches strings that are ID numbers corresponding to the cities of `'SAN'` or `'LAX'`, or the state of `'NY'`. Assume that there is only one city named `'SAN'` and only one city named `'LAX'`.

<br>

### Exercise 10 (⭐️⭐️⭐️)

Write a function named `match_10` that takes in a string and:
- converts the string to lowercase,
- removes all non-alphanumeric characters (i.e. removes everything that is not in the `\w` character class), and the letter `'a'`, and
- returns a list of every **non-overlapping** three-character substring in the remaining string, starting from the beginning of the string.
   
For instance, consider the following doctest:

```py
>>> match_10('Ab..DEF')
['bde']
```

Here's how `match_10` should process `'Ab..DEF'`:

1. Convert to lowercase: `'ab..def'`.
2. Remove non-alphanumeric characters and the letter `'a'`: `'bdef'`.
3. Starting from the beginning of the string, there is only a single non-overlapping three character substring: `'bde'`. Hence, we return `['bde']`.

***Note:*** Perform your operations in the exact order described above, otherwise your code may not pass all the tests. Don't use a `for`-loop.

In [5]:
def match_1(string):
    """
    DO NOT EDIT THE DOCSTRING!
    >>> match_1("abcde]")
    False
    >>> match_1("ab[cde")
    False
    >>> match_1("a[cd]")
    False
    >>> match_1("ab[cd]")
    True
    >>> match_1("1ab[cd]")
    False
    >>> match_1("ab[cd]ef")
    True
    >>> match_1("1b[#d] _")
    True
    """
    pattern = r'^..[\[]..[\]].*$'

    # Do not edit following code
    prog = re.compile(pattern)
    return prog.search(string) is not None


def match_2(string):
    """
    DO NOT EDIT THE DOCSTRING!
    >>> match_2("(123) 456-7890")
    False
    >>> match_2("858-456-7890")
    False
    >>> match_2("(858)45-7890")
    False
    >>> match_2("(858) 456-7890")
    True
    >>> match_2("(858)456-789")
    False
    >>> match_2("(858)456-7890")
    False
    >>> match_2("a(858) 456-7890")
    False
    >>> match_2("(858) 456-7890b")
    False
    """
    pattern = r'^\(858\) \d{3}-\d{4}$'

    # Do not edit following code
    prog = re.compile(pattern)
    return prog.search(string) is not None


def match_3(string):
    """
    DO NOT EDIT THE DOCSTRING!
    >>> match_3("qwertsd?")
    True
    >>> match_3("qw?ertsd?")
    True
    >>> match_3("ab c?")
    False
    >>> match_3("ab   c ?")
    True
    >>> match_3(" asdfqwes ?")
    False
    >>> match_3(" adfqwes ?")
    True
    >>> match_3(" adf!qes ?")
    False
    >>> match_3(" adf!qe? ")
    False
    """
    pattern = r'^[a-zA-Z0-9\s?]{5,9}\?$'

    # Do not edit following code
    prog = re.compile(pattern)
    return prog.search(string) is not None


def match_4(string):
    """
    DO NOT EDIT THE DOCSTRING!
    >>> match_4("$$AaaaaBbbbc")
    True
    >>> match_4("$!@#$aABc")
    True
    >>> match_4("$a$aABc")
    False
    >>> match_4("$iiuABc")
    False
    >>> match_4("123$$$Abc")
    False
    >>> match_4("$$Abc")
    True
    >>> match_4("$qw345t$AAAc")
    False
    >>> match_4("$s$Bca")
    False
    >>> match_4("$!@$")
    False
    """
    pattern = r'^\$[^abc$]*\$[Aa]+[Bb]+[Cc]+$'

    # Do not edit following code
    prog = re.compile(pattern)
    return prog.search(string) is not None


def match_5(string):
    """
    DO NOT EDIT THE DOCSTRING!
    >>> match_5("dsc80.py")
    True
    >>> match_5("dsc80py")
    False
    >>> match_5("dsc80..py")
    False
    >>> match_5("dsc80+.py")
    False
    """
    pattern = r'^[a-zA-Z0-9_]+\.py$'

    # Do not edit following code
    prog = re.compile(pattern)
    return prog.search(string) is not None


def match_6(string):
    """
    DO NOT EDIT THE DOCSTRING!
    >>> match_6("aab_cbb_bc")
    False
    >>> match_6("aab_cbbbc")
    True
    >>> match_6("aab_Abbbc")
    False
    >>> match_6("abcdef")
    False
    >>> match_6("ABCDEF_ABCD")
    False
    """
    pattern = r'^[a-z]+_[a-z]+$'

    # Do not edit following code
    prog = re.compile(pattern)
    return prog.search(string) is not None


def match_7(string):
    """
    DO NOT EDIT THE DOCSTRING!
    >>> match_7("_abc_")
    True
    >>> match_7("abd")
    False
    >>> match_7("bcd")
    False
    >>> match_7("_ncde")
    False
    """
    pattern = r'^_.*_$'

    # Do not edit following code
    prog = re.compile(pattern)
    return prog.search(string) is not None



def match_8(string):
    """
    DO NOT EDIT THE DOCSTRING!
    >>> match_8("ASJDKLFK10ASDO")
    False
    >>> match_8("ASJDKLFK0ASDo!!!!!!! !!!!!!!!!")
    True
    >>> match_8("JKLSDNM01IDKSL")
    False
    >>> match_8("ASDKJLdsi0SKLl")
    False
    >>> match_8("ASDJKL9380JKAL")
    True
    """
    pattern = r'^[^Oi1]*$'

    # Do not edit following code
    prog = re.compile(pattern)
    return prog.search(string) is not None



def match_9(string):
    '''
    DO NOT EDIT THE DOCSTRING!
    >>> match_9('NY-32-NYC-1232')
    True
    >>> match_9('ca-23-SAN-1231')
    False
    >>> match_9('MA-36-BOS-5465')
    False
    >>> match_9('CA-56-LAX-7895')
    True
    >>> match_9('NY-32-LAX-0000') # If the state is NY, the city can be any 3 letter code, including LAX or SAN!
    True
    >>> match_9('TX-32-SAN-4491')
    False
    '''
    pattern = r'^(NY-\d{2}-[A-Z]{3}-\d{4}|CA-\d{2}-(SAN|LAX)-\d{4})$'

    # Do not edit following code
    prog = re.compile(pattern)
    return prog.search(string) is not None


def match_10(string):
    '''
    DO NOT EDIT THE DOCSTRING!
    >>> match_10('ABCdef')
    ['bcd']
    >>> match_10(' DEFaabc !g ')
    ['def', 'bcg']
    >>> match_10('Come ti chiami?')
    ['com', 'eti', 'chi']
    >>> match_10('and')
    []
    >>> match_10('Ab..DEF')
    ['bde']
    
    '''
    string = string.lower()
    string = re.sub(r'[^0-9b-z]', '', string)
    return [string[i:i+3] for i in range(0, len(string), 3) if i+3 <= len(string)]

In [6]:
grader.check("q1")

## Question 2 – Capturing Groups in Regular Expressions 📡

The dataset stored in `data/messy.txt` contains personal information from a fictional website that a user scraped from web server logs. Within this dataset, there are four fields that are of interest to you:
1. Email Addresses (assume they are alphanumeric usernames and domain names)
2. [Social Security Numbers](https://en.wikipedia.org/wiki/Social_Security_number#Structure)
3. Bitcoin Addresses (alphanumeric strings of long length)
4. Street Addresses

Complete the implementation of the function `extract_personal`, which takes in a string containing the contents of a server log file (like `open('data/messy.txt').read()`) and returns a **tuple of four separate lists** containing values of the 4 pieces of information listed above (in the order listed above). Do **not** keep empty values.

***Note:*** Since this data is messy, your function will be allowed to miss ~5% of the records in each list. Good spot checking using certain useful substrings (e.g. `'@'` for emails) should help assure correctness! Your function will be tested on a sample of the file `messy.txt`.

***Hint:*** There are multiple "delimiters" in use in the file; there are few enough of them that you can safely determine what they are.

In [7]:
# experiment with extract_personal using the file s below
fp = os.path.join('data', 'messy.txt')
s = open(fp, encoding='utf8').read()

In [8]:
# s

In [9]:
def extract_personal(s):
    email_reg = r'[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}'
    ssn_reg = r'\b\d{3}-\d{2}-\d{4}\b'
    bitcoin_reg = r'\b1[a-km-zA-HJ-NP-Z0-9]{26,35}\b'
    street_reg = r'\d{2,5}\s\w+(?:\s\w+)*\s(?:Street|St|Avenue|Ave|Boulevard|Blvd|Court|Ct|Drive|Dr|Lane|Ln|Parkway|Pkwy|Road|Rd|Trail|Trl|Way|Plaza|Plz|Terrace|Ter|Place|Pl|Circle|Cir|Square|Sq|Loop|Lp)'

    email = re.findall(email_reg, s)
    ssn = re.findall(ssn_reg, s)
    bitcoin = re.findall(bitcoin_reg, s)
    street = re.findall(street_reg, s)

    return (email, ssn, bitcoin, street)

In [10]:
# don't change this cell, but do run it -- it is needed for the tests
test_fp = os.path.join('data', 'messy.test.txt')
test_s = open(test_fp, encoding='utf8').read()
emails, ssn, bitcoin, addresses = extract_personal(test_s)

In [11]:
grader.check("q2")

## Question 3 – TF-IDF 📊

The dataset `data/reviews.txt` contains [Amazon reviews](https://cseweb.ucsd.edu/~jmcauley/datasets.html#amazon_reviews/) for ~200k phones and phone accessories. The dataset has already been "cleaned" for you. In this question, you will complete the implementation of a function, which takes in the reviews dataset as a Series (with one entry per review) as well as a single review, and returns the word that "best summarizes the single review" using TF-IDF.

To do so, implement the two functions below.

#### `tfidf_data`
You will first need to decompose the review into individual words, count their frequencies, and calculate the TF-IDF score for each word. Complete the implementation of the function `tfidf_data`, which takes in the reviews data as a Series (`reviews_ser`) and a single review (`review`) from that Series and returns a DataFrame indexed by the words in `review` with four columns:
- `'cnt'`: the number of times each word is found in the review (telling how frequently a word is used in the `review`)
- `'tf'`: the term frequency for each word (giving an idea of how important each word relative to the other words in the `review`)
- `'idf'`: the inverse document frequency for each word (helping understand how important each word is to the whole `reviews_ser`)
- `'tfidf'` the TF-IDF for each word (measuring how important each word is to `review`, taking into account both the frequency of the word in the `review` and across `reviews_ser`)

You may use a `for`-loop. The words in the outputted DataFrame may appear in any order.

You may also assume that the provided `review` will always be an element of the `reviews_ser` Series, so you do not need to account for division-by-zero errors.

***Hint:*** You may need to use the [`'\b'` character](https://www.regular-expressions.info/wordboundaries.html) somewhere.
    
<br>

#### `relevant_word`

Complete the implementation of the function `tfidf_data`, which takes in the DataFrame that `tfidf_data` returns and returns the word that "best summarizes" the review. If there are multiple "best" summary words, return any one of them.



In [12]:
# experiment with tfidf_data using reviews_ser and review below 
fp = os.path.join('data', 'reviews.txt')
reviews_ser = pd.read_csv(fp, header=None).squeeze("columns")
review = open(os.path.join('data', 'review.txt'), encoding='utf8').read().strip()

In [13]:
review
# reviews_ser

'this is a great new case design that i have not seen before it has a slim silicone skin that really locks in the phone to cover and protect your phone from spills and such and also a hard polycarbonate outside shell cover to guard it against damage  this case also comes with different interchangeable skins and covers to create multiple color combinations  this is a different kind of case than the usual chunk of plastic  it is innovative and suits the iphone 5 perfectly'

In [14]:
def tfidf_data(reviews_ser, review):
    review_words = review.replace('  ', ' ').split(' ')
    review_len = len(review_words)

    reviews_ser_len = len(reviews_ser)

    cnt = {}
    tf = {}
    idf = {}
    tfidf = {}

    for word in review_words:
        count = review_words.count(word)
        cnt[word] = count

        freq = count/review_len
        tf[word] = freq

        word_appears = sum(1 for review in reviews_ser if re.search(r'\b' + word + r'\b', review))
        inv_doc_freq = np.log(reviews_ser_len / word_appears)
        idf[word] = inv_doc_freq

        tfidf[word] = freq * inv_doc_freq

    return pd.DataFrame({'cnt':cnt, 'tf':tf, 'idf':idf, 'tfidf':tfidf})


def relevant_word(tfidf_df):
    best_word = tfidf_df['tfidf'].idxmax()
    return best_word

In [15]:
# don't change this cell, but do run it -- it is needed for the tests
fp = os.path.join('data', 'reviews.txt')
reviews_ser = pd.read_csv(fp, header=None).squeeze("columns")
review = open(os.path.join('data', 'review.txt'), encoding='utf8').read().strip()
q3_tfidf = tfidf_data(reviews_ser, review)

try:
    q3_rel = relevant_word(q3_tfidf)
except:
    q3_rel = None

In [16]:
q3_tfidf

Unnamed: 0,cnt,tf,idf,tfidf
this,3,0.035294,0.441295,0.015575
is,3,0.035294,0.516393,0.018226
a,4,0.047059,0.360392,0.01696
great,1,0.011765,1.301647,0.015313
new,1,0.011765,2.610749,0.030715
case,3,0.035294,1.121935,0.039598
design,1,0.011765,3.055127,0.035943
that,2,0.023529,0.845801,0.019901
i,1,0.011765,0.254672,0.002996
have,1,0.011765,0.975412,0.011475


In [17]:
# print(q3_rel)

In [18]:
# print(q3_tfidf.index)

In [19]:
grader.check("q3")

## Questions 4 and 5 – Tweet Analysis 🐥

The dataset `data/ira.csv` contains tweets tagged by Twitter as likely being posted by the [Internet Research Agency](https://en.wikipedia.org/wiki/Internet_Research_Agency), the tweet factory facing allegations for attempting to influence US political elections.

Questions 4 and 5 will focus on the following:
- Question 4: Look at the hashtags present in the text and trends in their makeup.
- Question 5: Prepare the dataset for modeling by creating features out of the text fields.

### Question 4 – Hashtags #️⃣

You may assume that a hashtag is any string without whitespace that immediately follows a `'#'` symbol. To clarify, a hashtags can contain a `'#'` symbol, as long as there is no whitespace. For example, `'#data#science'` is considered one individual hashtag, whereas `'#data #science'` is considered two individual hashtags.

#### `hashtag_list`

Complete the implementation of the function `hashtag_list`, which takes in a Series of tweet texts and returns a Series containing a list of hashtags present in each tweet's text. If a tweet's text doesn't contain a hashtag, the Series should contain an empty list for that tweet. Don't include the leading `'#'` symbol in the lists that are returned.

<br>

#### `most_common_hashtag`

Complete the implementation of the function `most_common_hashtag`, which takes in a Series of hashtag lists (as is outputted by `hashtag_list`) and returns a Series consisting of a single hashtag per tweet: 
- If the tweet's text has no hashtags, the entry should in the output Series should be `NaN`.
- If the tweet's text has one distinct hashtag, the entry in the output Series should be that hashtag.
- If the tweet's text has more than one hashtag, the entry in the output Series should be the most common hashtag **in the tweet's text** with respect to **the whole input Series**. If there is a tie for the most common, any of the most common can be returned.
    - For example, if the input Series was `pd.Series([[2], [2], [3, 2, 3]])`, the output would be `pd.Series([2, 2, 2])`. Even though `3` was more common in the third list than `2`, `2` is more common than `3` among all hashtags in the Series.

In [20]:
# The public tests don't test your work on the `ira` data,
# but the hidden tests do.
# So, make sure to thoroughly test your work yourself!
fp = os.path.join('data', 'ira.csv')
ira = pd.read_csv(fp, names=['id', 'name', 'date', 'text'])
# ira

In [47]:
def hashtag_list(tweet_text):
    total_hash = []
    for tweet in tweet_text:
        hashtag = re.findall(r'#[^\s]*', tweet)
        hashtag = list(map(lambda x: x[1:], hashtag))
        total_hash.append(hashtag)
    return pd.Series(total_hash)


def most_common_hashtag(tweet_lists):
    count = {}
    for tags in tweet_lists:
        for tag in tags:
            if tag not in count:
                count[tag] = 1
            else:
                count[tag] += 1

    hash_list = []
    for tags in tweet_lists:
        if len(tags) == 0:
            hash_list.append(np.nan)
        elif len(tags) == 1:
            hash_list.append(tags[0])
        else:
            common = ''
            for i in range(len(tags) - 1):
                if count[tags[i]] <= count[tags[i+1]]:
                    common=tags[i+1]
                else:
                    common=tags[i]
            hash_list.append(common)

    return pd.Series(hash_list)

In [48]:
# testdata = [['RT @DSC80: Text-cleaning is cool! #NLP https://t.co/xsfdw88d #NLP1 #NLP1']]
# test = pd.DataFrame(testdata, columns=['text'])['text']
# most_common_hashtag(test).iloc[0] == 'NLP1'
# out = hashtag_list(test)

In [49]:
grader.check("q4")

### Question 5 – Features 📋

Now, create a DataFrame of features from the `ira` data.  That is, create a function `create_features` that takes in a DataFrame `ira` that has just a single column, `'text'`, and returns a DataFrame with the same index as `ira` (i.e. the rows correspond to the same tweets) and the following columns:
* `'num_hashtags'`, the number of hashtags present in the tweet.
* `'mc_hashtags'`, the most common hashtag associated to the tweet (using the result of `most_common_hashtag` from Question 4).
* `'num_tags'`, the number of tags the tweet has.
    - A tag is a string starting with the `'@'` symbol followed by only alphanumerical characters. A`'@'` symbol on its own does not count as a tag.
* `'num_links'`, the number of hyperlinks present in the tweet.
    - A hyperlink is a string starting with `'http://'` or `'https://'`, not followed by whitespaces.
* `'is_retweet'`, a Boolean describing whether the tweet is a retweet. A retweet is a tweet that **begins** with `'RT'`.
* `'text'`, a version of the tweet's text that is cleaned according to the following steps, **in this exact order**:
    1. All meta-information above (retweet info, tags, hyperlinks, and hashtags) should be replaced with a single space.
    2. Everything other than letters, numbers, and spaces should be replaced with a single space.
    3. All letters should be lowercase.
    4. All words should be separated by exactly one space, and leading/trailing whitespace should be removed (stripped).
    
The columns in the outputted DataFrame must be in the order `['text', 'num_hashtags', 'mc_hashtags', 'num_tags', 'num_links', 'is_retweet']`. (Remember, the DataFrame that `create_features` is called on only has a single column, `'text'`.)

***Notes:***
- It's a good idea to make helper function for each column.
- The `\w` character class in regex **does not** refer to letters, numbers, and spaces (or even just letters and numbers). As such, you can't use it here!
- `create_features` will take a while to run on the entire dataset – test it on a small sample first!

In [71]:
# The doctests/public tests don't test your work on the `ira` data,
# but the hidden tests do.
# So, make sure to thoroughly test your work yourself!
fp = os.path.join('data', 'ira.csv')
ira = pd.read_csv(fp, names=['id', 'name', 'date', 'text'])

In [80]:
def tag_helper(tweet):
    total = []
    for text in tweet:
        tag = re.findall(r'@[\w]+', text)
        tag = list(map(lambda x: x[1:], tag))
        total.append(tag)
    return pd.Series(total).apply(lambda x: len(x))

def link_helper(tweet):
    total = []
    for text in tweet:
        link = re.findall(r'https*:\/\/[^\s]*', text)
        link = list(map(lambda x: x[1:], link))
        total.append(link)
    return pd.Series(total).apply(lambda x: len(x))

def clean_helper(tweet):
    cleaned = []
    for text in tweet:
        text = re.sub(r'RT', ' ', text)
        text = re.sub(r'@[\w]+', ' ', text)
        text = re.sub(r'https*:\/\/[^\s]*', ' ', text)
        text = re.sub(r'#[^\s]*', ' ', text)
        text = re.sub(r'[^A-Za-z\d\s]+', ' ', text)
        text = text.lower()
        text = re.sub(r'\s+', ' ', text).strip()
        cleaned.append(text)
    return pd.Series(cleaned)

In [93]:
def create_features(ira):
    df = ira.index
    
    num_hashtags = {}
    mc_hashtags = {}
    num_tags = {}
    num_links = {}
    is_retweet = {}
    text = {}

    texts = ira['text']
    hashtags = hashtag_list(texts)

    common_hashtags = most_common_hashtag(hashtags)

    tags_counts = tag_helper(texts)

    link_counts = link_helper(texts)

    cleaned_texts = clean_helper(texts)
    
    for i in df:
        hashtag_len = len(hashtags.iloc[i])
        num_hashtags[i] = hashtag_len
        
        mc_hashtags[i] = common_hashtags.iloc[i]

        num_tags[i] = tags_counts.iloc[i]

        num_links[i] = link_counts.iloc[i]

        is_retweet[i] = texts.iloc[i][:2] == 'RT'

        text[i] = cleaned_texts.iloc[i]
        
    return pd.DataFrame({
        'text': text,
        'num_hashtags': num_hashtags,
        'mc_hashtags': mc_hashtags,
        'num_tags': num_tags,
        'num_links': num_links,
        'is_retweet':is_retweet
    })

In [94]:
# don't change this cell, but do run it -- it is needed for the tests
# (yes, we know it says "hidden" – there are still truly hidden tests in this question)
fp_hidden = 'data/ira_test.csv'
ira_hidden = pd.read_csv(fp_hidden, header=None)
text_hidden = ira_hidden.iloc[:, -1:]
text_hidden.columns = ['text']
test_hidden = create_features(text_hidden)

In [95]:
grader.check("q5")

## Congratulations! You're done Lab 7! 🏁

As a reminder, all of the work you want to submit needs to be in `lab.py`.

To ensure that all of the work you want to submit is in `lab.py`, we've included a script named `lab-validation.py` in the lab folder. You shouldn't edit it, but instead, you should call it from the command line (e.g. the Terminal) to test your work.

Once you've finished the lab, you should open the command line and run, in the directory for this lab:

```
python lab-validation.py
```

**This will run all of the `grader.check` cells that you see in this notebook, but only using the code in `lab.py` – that is, it doesn't look at any of the code in this notebook. If all of your `grader.check` cells pass in this notebook but not all of them pass in your command line with the above command, then you likely have code in your notebook that isn't in your `lab.py`!**

You can also use `lab-validation.py` to test individual questions. For instance,

```
python lab-validation.py q1 q2 q4
```

will run the `grader.check` cells for Questions 1, 2, and 4 – again, only using the code in `lab.py`. [This video](https://www.loom.com/share/0ea254b85b2745e59322b5e5a8692e91?sid=5acc92e6-0dfe-4555-9b6a-8115b6a52f99) how to use the script as well.

Once `python lab-validation.py` shows that you're passing all test cases, you're ready to submit your `lab.py` (and only your `lab.py`) to Gradescope. Once submitting to Gradescope, make sure to stick around until all test cases pass.

There is also a call to `grader.check_all()` below in _this_ notebook, but make sure to also follow the steps above.

---

To double-check your work, the cell below will rerun all of the autograder tests.

In [29]:
grader.check_all()

q1 results: All test cases passed!

q2 results: All test cases passed!

q3 results: All test cases passed!

q4 results: All test cases passed!

q5 results:
    q5 - 1 result:
        Trying:
            testdata = [['RT @DSC80: Text-cleaning is cool! #NLP https://t.co/xsfdw88d #NLP1 #NLP1']]
        Expecting nothing
        ok
        Trying:
            test = pd.DataFrame(testdata, columns=['text'])
        Expecting nothing
        ok
        Trying:
            out = create_features(test)
        Expecting nothing
        ok
        Trying:
            anscols = ['text', 'num_hashtags', 'mc_hashtags', 'num_tags', 'num_links', 'is_retweet']
        Expecting nothing
        ok
        Trying:
            ansdata = [['text cleaning is cool', 3, 'NLP1', 1, 1, True]]
        Expecting nothing
        ok
        Trying:
            ans = pd.DataFrame(ansdata, columns=anscols)
        Expecting nothing
        ok
        Trying:
            (out == ans).all().all()
        Expecting:
  