# Text Processing

The file `UCSD tweets.csv` has a small number of tweets from August and September 2018 that contained the term "UCSD".  Let's analyze them!

### Steps

Each step is explained in more detail below
1. Open the CSV and explore the data
2. Clean the data
3. Count word frequency
4. Sentiment analysis

In [2]:
# Imports
%matplotlib inline

import numpy as np
from datascience import *

# 1. Open the CSV and explore the data

### Steps:
* load the data from `data/UCSD tweets.csv`
* Examine the data.  How many tweets are there?  How long is the shortest tweet?  Longest?

In [4]:
# you code here
tweets = [x.strip() for x in open("data/UCSD tweets.csv").readlines()]

In [5]:
# your code here
tweets

['username,date,text',
 '@fox5sandiego,Aug 27,UCSD ranked 7th best school in US by Washington Monthly',
 '@KamaFaye,Sep 1,it physically pains me that UCSD doesn’t have a football team',
 '@team10news_CA,Sep 2,A card mis-judged a parking space and over-ran a parking lot in the process taking out about about 40 to 50 feet of chain-link fence this morning at the Nobel apartments in the UCSD area.  Around 9:10 AM I... https://www.facebook.com/JAMIESCOTTmobile/posts/2177799288905402 …',
 '@SapnaKmd,Sep 2,Almost 50% increase in #PedsICU engagement from last week & >1000 tweets!👌🏽Special shoutout to @DrKanaris for the new Friday #PedsICU quiz tradition & @UCSD_PICU for the great #meded! @healthhashtags #medtwitter',
 '@kabeerthirty,Aug 27T,here’s a tinge of classism in saying SDSU should make do with its 283 acres while UCSD accommodates roughly the same number of students on its idyllic 2141-acre campus',
 '@KNBKSoshihan,Sep 2,This past weekend we observed the 25th anniversary of the UCSD Ia

# 2. Clean the data

### Filter out tweets that don't have `UCSD` in the text

The Twitter search matches on both username and tweet text.  We want just the ones that have a match in the tweet itself.  The result should be a new dataframe with the subset of matching tweets.

* create functions to apply to your table and clean your data
* use where clauses to filter

In [4]:
# your code here

## Check for duplicates

See if any of the tweets have exactly the same text.  If so, are they true duplicates?  Does it make sense to remove them?

In [None]:
# your code here

# 2. Count Words

We want to find out what the most frequent words are, so we need to split things up.  In text this is called tokenizing.

### Steps

1. Make a single long string with all of the tweet text.  Make sure to put spaces between them!
2. Split the tweets into a list of words using `.split()`
3. Print out the first 20 words to make sure it looks like what you think it should.

How many words are there all together?  How many distinct words? (remember `set()`)

In [None]:
# your code here

### Remove short words

Short words are really common, and aren't super helpful for comparing word count.  Usually it is best to remove what are called "stop words", which include things like "of", "a", "in", etc.  In this case we will just remove all words that are less than three charecters long.

The result should be a new list of words.  How many total?  How many unique?

In [None]:
# Your code here

# 3. Count word frequency

You can use a dictionary to create a categorical distribution of the words in a sentence:

In [3]:
my_sentence = 'Jack be nimble, Jack be quick, Jack jump over the candlestick'
my_words = my_sentence.split()

categorical_distribution = {} # empty dictionary
for word in my_words:
    if word in categorical_distribution:
        categorical_distribution[word] = categorical_distribution[word] + 1
    else:
        categorical_distribution[word] = 1
        
categorical_distribution

{'Jack': 3,
 'be': 2,
 'nimble,': 1,
 'quick,': 1,
 'jump': 1,
 'over': 1,
 'the': 1,
 'candlestick': 1}

Create a categorical distribution of words for all tweets.  
* Are you surprised by the most common?

In [4]:
# your code here

# 2.b. Tokenize again (with NLTK this time)

Why is UCSD only in 18?  

Because of `@UCSD` and similar.  

Tokenizing (like most things) is harder than it looks at first!  

Generally, a good solution is to use a tool built for the job rather than rolling your own.  In this case, we will use the Python package Natural Language ToolKit, NLTK.  

You may need to install NLTK and also download an English language corpus.  If so, do this in the terminal:

```
pip install --user nltk
```

Then in Jupyter run this once:
```
import nltk
nltk.download('punkt')
```

Run the code below to use NLTK's tokenizer, and then repeat the process of removing short words and counting.

In [1]:
from nltk import tokenize
import nltk
nltk.download('punkt')

[nltk_data] Downloading package punkt to /Users/afraenkel/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


True

In [None]:
allText = ... # pass in a string consisting of all tweets

wordList = tokenize.word_tokenize(allText)
len(wordList)

# 3.b. Count (again)

In [None]:
# Remove short words

In [None]:
# Count

# Sentiment with NLTK

What is sentiment?  Why do we care?

Will need to run once:
```
nltk.download('vader_lexicon')
```

In [9]:
nltk.download('vader_lexicon')

[nltk_data] Downloading package vader_lexicon to
[nltk_data]     /Users/afraenkel/nltk_data...


True

In [7]:
from nltk.sentiment.vader import SentimentIntensityAnalyzer



In [10]:
sid = SentimentIntensityAnalyzer()
sid.polarity_scores("Good test!")

{'neg': 0.0, 'neu': 0.239, 'pos': 0.761, 'compound': 0.4926}

In [11]:
tweets = Table.read_table('data/UCSD tweets.csv')
tweetList = tweets.column('text') # a list of strings, with each string a tweet

In [None]:
tweetSentiments = []

for tweet in tweetList:
    tweetSentiment = sid.polarity_scores(tweet)
    tweetSentiment['text'] = tweet
    tweetSentiments.append(tweetSentiment)
    
tweetSentiments # a list of dictionaries, with text of tweets and sentiments


In [18]:
Table.from_records(tweetSentiments)

compound,neg,neu,pos,text
0.6369,0.0,0.682,0.318,UCSD ranked 7th best school in US by Washington Monthly
-0.4215,0.237,0.763,0.0,it physically pains me that UCSD doesn’t have a football ...
0.0,0.0,1.0,0.0,A card mis-judged a parking space and over-ran a parking ...
0.8659,0.0,0.72,0.28,Almost 50% increase in #PedsICU engagement from last wee ...
0.0772,0.0,0.954,0.046,here’s a tinge of classism in saying SDSU should make do ...
0.4588,0.0,0.885,0.115,This past weekend we observed the 25th anniversary of th ...
-0.6801,0.424,0.443,0.133,I HATE UCSD SO MUCH THEYRE SO ANNOYING OHHHHH MY GOD
-0.5319,0.231,0.657,0.112,LMFAOOO UCSD JUST CALLED ME ASKING IF I COULD MAKE A $25 ...
0.6369,0.0,0.682,0.318,I chose UCSD because it has the best fb meme group
0.34,0.0,0.789,0.211,High key excited to be going back to UCSD 💙💛


In [None]:
tweetSentimentDf.sort_values('compound')

# Next Steps

* load the file of "internet research agency" tweets (a small sample) and explore!
    - `data/ira.csv`