### Learning Python the Hard Way - Session 2
Toronto Data Literacy Group

Creator: Cindy Zhong

Date: January 09, 2017

#### Reading The Data

The data for the file can be downloaded from the github repository. 
If you want to get the data from Twitter youself, it is created using the code from Session 1. https://github.com/cindyzhong/trt_data_lit_grp_python/tree/master/Lesson1

Next, read the tab-delimited file into Python. To do this, we can use the pandas package which provides the read_csv function for easily reading and writing data files. If you haven't used pandas before, you may need to install it.

In [None]:
import pandas as pd
pd.set_option('display.max_colwidth', -1)
tweet_df = pd.read_csv("tweet_sample.csv", delimiter=",", encoding = "utf-8")

In [None]:
# A look at the dimension of the dataframe
tweet_df.shape

In [None]:
# A look at the columns of the dataframe
tweet_df.columns.values

In [None]:
# A look at sample data
tweet_df[0:5]

In [None]:
# We can also look at a ramdom sample of the rows
tweet_df.sample(5)

In [None]:
# Let's use one tweet as an example
tweet_eg = tweet_df['tweet_body'][8]
tweet_eg

#### Cleaning and Pre-Processing The Texts
We are interested in the text of the tweets.
The unique thing about text analytics is there is no standard way of pre-processing the data. Depending on the problem you are trying to solve, the pre-processing can be different.
In most cases, it consist of the following components:
- Removing Unwanted Characters
- Removing Punctuations
- Removing Numbers
- Standardizing Cases
- Removing Stopwords
We will explain each of them in our session.
We will be using a package called NLTK (Natural Language Toolkit), and a package called re (Regular Expression) extensively in this exercise.

#### Basic Text Cleaning Techniques

In [None]:
# Regular Expression itself is a very useful skill to learn.
import re

In [None]:
# A lot of the tweets contains reference urls, we want to remove them first
def remove_url(text):
	text = re.sub('http://[^ ]*', '', text)
	text = re.sub('https://[^ ]*', '', text)
	return text

In [None]:
# Using the function on our sample tweet
tweet_eg = remove_url(tweet_eg)
tweet_eg

In [None]:
# Removing the at users
def remove_at_user(text):
	import re
	return re.sub('@[^\s]+','', text)

In [None]:
tweet_eg = remove_at_user(tweet_eg)
tweet_eg

In [None]:
# Now try to write a function to remove the retweet 'RT'
def remove_rt(text):
    text = re.sub('RT', '', text, count=1)
    return text

In [None]:
tweet_eg = remove_rt(tweet_eg)
tweet_eg

In [None]:
# Let's remove the punctuations and numbers, basically all the non letters for now
def remove_non_letters(text):
	return re.sub("[^a-zA-Z]", " ", text) 	

In [None]:
tweet_eg = remove_non_letters(tweet_eg)
tweet_eg

In [None]:
# We might want to remove some extra blanks
def remove_extra_blanks(text):
	text = re.sub('\n', ' ', text)
	text = re.sub(" +"," ",text).strip() #remove extra spaces
	return text

In [None]:
tweet_eg = remove_extra_blanks(tweet_eg)
tweet_eg

In [None]:
# Standardizing Cases
def all_lower_case(text):
	return text.lower()

tweet_eg = all_lower_case(tweet_eg)
tweet_eg

In [None]:
# Now, let's put all of the above cleaning functions together
def my_text_cleanser(text):
    if isinstance(text,basestring):
        text = text.encode('utf-8')
        text = remove_url(text)
        text = remove_rt(text)
        text = remove_non_letters(text)
        text = remove_extra_blanks(text)
        text = all_lower_case(text)
        return text

In [None]:
# We will apply the text cleanser to our 'tweet_body' column, using a very commonly used function in pandas 'apply'
tweet_df['tweet_body_clean'] = tweet_df.tweet_body.apply(my_text_cleanser)

In [None]:
# Take a look at the old column and the cleaned new column
tweet_df[['tweet_body','tweet_body_clean']].sample(5)

#### Removing Stopwords
Stopwords are words that occur in a sentence often that do not carry any meanings, for example, 'am','and','the'.
We often want to remove these words when we are doing text analytics.
To do this, we will use NLTK

In [None]:
# If you haven't done so already, download the nltk's corpus for stopwords
import nltk
nltk.download()

In [None]:
# Import the stop word list
from nltk.corpus import stopwords 
print (stopwords.words("english")) 

In [None]:
def remove_stopwords(text):
    words = text.split()
    meaningful_words = [w for w in words if not w in stopwords.words("english") ]
    return meaningful_words

In [None]:
tweet_eg = remove_stopwords(tweet_eg)
tweet_eg

#### Word Stemming
In linguistic morphology and information retrieval, stemming is the process for reducing inflected (or sometimes derived) words to their stem, base or root form—generally a written word form.

In [None]:
# Examples of stemmed words
from nltk.stem import SnowballStemmer
snowball_stemmer = SnowballStemmer("english")
print (snowball_stemmer.stem('interaction'))
print (snowball_stemmer.stem('interact'))
print (snowball_stemmer.stem('interactions'))
print (snowball_stemmer.stem('interactivity'))

#### Word Lemmatization
Lemmatisation (or lemmatization) in linguistics, is the process of grouping together the different inflected forms of a word so they can be analysed as a single item.

In [None]:
from nltk.stem import WordNetLemmatizer
wordnet_lemmatizer = WordNetLemmatizer()
print (wordnet_lemmatizer.lemmatize('interaction'))
print (wordnet_lemmatizer.lemmatize('interact'))
print (wordnet_lemmatizer.lemmatize('interactions'))
print (wordnet_lemmatizer.lemmatize('interactivity'))

In [None]:
# We will be using the lemmatizer for our purpose
def lemmatizer(words):
    return [wordnet_lemmatizer.lemmatize(w) for w in words]

In [None]:
def my_text_tokenizer(text):
    words = remove_stopwords(text)
    words = lemmatizer(words)
    return words

In [None]:
# Now let's apply the functions above to our cleaned tweet
tweet_df['tweet_body_terms'] = tweet_df.tweet_body_clean.apply(my_text_tokenizer)

In [None]:
# Take a look at what we've done so far
tweet_df[['tweet_body','tweet_body_clean','tweet_body_terms']].sample(5)

### Simple Text Analytics on Tweets
With the text pre-processed, we can now do some simple but interesting analytics on the tweets, in this session, we will look at for Trump and Hilary 
- Term Collocation
- Lexical Diversity

In [None]:
# Since we will be creating statistics at user level, we group the dataframe by users
users_df = tweet_df.groupby('handle').agg({'tweet_body_terms':sum,'tweet_body_clean':lambda x: ' '.join(x)})

#### Term Collocations
Collocations are partly or fully fixed expressions that become established through repeated context-dependent use. 
For example, 'crystal clear', 'middle management', and 'plastic surgery' are examples of collocated pairs of words.
We are interested in looking at term collocations the context gives us a better insight about the meaning of a term, supporting applications such as word disambiguation or semantic similarity.

In [None]:
# Find top collocation in the tweets
from nltk.collocations import BigramCollocationFinder

def top_collocation_text(words):
    bigram_measures = nltk.collocations.BigramAssocMeasures()
    finder = BigramCollocationFinder.from_words(words)
    finder.apply_freq_filter(5)
    return finder.nbest(bigram_measures.pmi, 20)

In [None]:
# Let's see what are the most often talked about terms for Hilary and Trump
users_df['top_collocation_text'] = users_df.tweet_body_terms.apply(top_collocation_text)

In [None]:
print (users_df['top_collocation_text'])

#### Lexical Diversity
Lexical diversity is a measure of how many different words that are used in a text.
The more varied a vocabulary a text possesses, the higher lexical diversity.
For a text to be highly lexically diverse, the speaker or writer has to use many
different words, with littie repetition of the words already used. 
The lexical diversity of a given text is defined as the ratio of total number of words to the number of different unique word stems.

In [None]:
def lexical_diversity(words):
    return 1.0*len(set(words))/(len(words)+1)

In [None]:
users_df['lexical_diversity'] = users_df.tweet_body_terms.apply(lexical_diversity)

In [None]:
print (users_df['lexical_diversity'])