## Term Frequency

Before I move on to the sentiment analysis, I hope to realize what are the words most commonly used in the date set and how many times it is used in entire corpus. The exploratory anlaysis that I can perform is a simple word count. 

The sklearn.CountVectorizer converts a collection of text documents to a matrix of token counts. This implementation produces a sparse representation of the counts using scipy.sparse.csr_matrix. There are parameter options available for count vectorizer, such as removing stop-words or limiting the maximum number of terms. 

In [4]:
import os
os.chdir('C:\\Users\\yaffy\\Desktop\\python')

import pandas as pd
import numpy as np

my_df_madrid = pd.read_csv('clean_madrid.csv')
my_df_madrid.head()

Unnamed: 0.1,Unnamed: 0,name,screen_name,id,friends,followers,listed,clean_text,period
0,0,Binyamin Alsagoff,benalsagoff676,325975600.0,211,221,3,countdown final kyiv russia vs,1
1,1,LaLiga,LaLiga,423384500.0,828,3632180,5536,llega los goles con el,1
2,2,Vida Social Bogotá,SocialEnBogota,2611749000.0,14598,34695,290,llega los goles con el,1
3,3,TD.,iamthetd,1066035000.0,557,94,6,well ended seasons campaign way started it shi...,1
4,4,bigtime or bigmike,Michaelgreatho2,2440701000.0,1918,742,4,yall late,1


In [5]:
from sklearn.feature_extraction.text import CountVectorizer
cvec = CountVectorizer(stop_words='english',max_features=10000)
cvec.fit(my_df_madrid.clean_text)

CountVectorizer(analyzer=u'word', binary=False, decode_error=u'strict',
        dtype=<type 'numpy.int64'>, encoding=u'utf-8', input=u'content',
        lowercase=True, max_df=1.0, max_features=10000, min_df=1,
        ngram_range=(1, 1), preprocessor=None, stop_words='english',
        strip_accents=None, token_pattern=u'(?u)\\b\\w\\w+\\b',
        tokenizer=None, vocabulary=None)

In [6]:
pre_doc_matrix = cvec.transform(my_df_madrid[my_df_madrid.period == 1].clean_text)
mid_doc_matrix = cvec.transform(my_df_madrid[my_df_madrid.period == 2].clean_text)
post_doc_matrix = cvec.transform(my_df_madrid[my_df_madrid.period == 3].clean_text)
post2_doc_matrix = cvec.transform(my_df_madrid[my_df_madrid.period == 4].clean_text)
post3_doc_matrix = cvec.transform(my_df_madrid[my_df_madrid.period == 5].clean_text)

pre_tf = np.sum(pre_doc_matrix,axis=0)
mid_tf = np.sum(mid_doc_matrix,axis=0)
post_tf = np.sum(post_doc_matrix,axis=0)
post2_tf = np.sum(post2_doc_matrix,axis=0)
post3_tf = np.sum(post3_doc_matrix,axis=0)

pre = np.squeeze(np.asarray(pre_tf))
mid = np.squeeze(np.asarray(mid_tf))
post = np.squeeze(np.asarray(post_tf))
post2 = np.squeeze(np.asarray(post2_tf))
post3 = np.squeeze(np.asarray(post3_tf))

term_freq_df = pd.DataFrame([pre,mid,post,post2,post3],columns=cvec.get_feature_names()).transpose()

In [7]:
term_freq_df['total'] = term_freq_df[0] + term_freq_df[1] + term_freq_df[2] + term_freq_df[3] + term_freq_df[4]
term_freq_df.sort_values(by='total', ascending=False).iloc[:10]

Unnamed: 0,0,1,2,3,4,total
madrid,134,871,534,39,237,1815
real,101,662,420,42,189,1414
liverpool,176,787,185,17,5,1170
win,154,656,252,13,42,1117
champions,107,589,309,20,28,1053
team,88,402,321,21,102,934
zidane,33,99,688,26,50,896
final,245,416,152,29,14,856
congratulations,4,664,154,3,12,837
league,103,423,244,25,26,821


In [8]:
term_freq_df['pre_freq_pct'] = term_freq_df[0] * 1./term_freq_df[0].sum()
term_freq_df.sort_values(by='pre_freq_pct', ascending=False).iloc[:10]

Unnamed: 0,0,1,2,3,4,total,pre_freq_pct
final,245,416,152,29,14,856,0.021791
liverpool,176,787,185,17,5,1170,0.015654
win,154,656,252,13,42,1117,0.013697
tomorrow,134,20,21,2,7,184,0.011919
madrid,134,871,534,39,237,1815,0.011919
champions,107,589,309,20,28,1053,0.009517
league,103,423,244,25,26,821,0.009161
real,101,662,420,42,189,1414,0.008983
saturday,93,10,20,1,1,125,0.008272
good,89,226,210,17,94,636,0.007916


In [9]:
term_freq_df['mid_freq_pct'] = term_freq_df[1] * 1./term_freq_df[1].sum()
term_freq_df['post_freq_pct'] = term_freq_df[2] * 1./term_freq_df[2].sum()
term_freq_df['post2_freq_pct'] = term_freq_df[3] * 1./term_freq_df[3].sum()
term_freq_df['post3_freq_pct'] = term_freq_df[4] * 1./term_freq_df[4].sum()

term_freq_df.sort_values(by='total', ascending=False).iloc[:10]

Unnamed: 0,0,1,2,3,4,total,pre_freq_pct,mid_freq_pct,post_freq_pct,post2_freq_pct,post3_freq_pct
madrid,134,871,534,39,237,1815,0.011919,0.019114,0.010511,0.007237,0.015104
real,101,662,420,42,189,1414,0.008983,0.014527,0.008267,0.007794,0.012045
liverpool,176,787,185,17,5,1170,0.015654,0.017271,0.003642,0.003155,0.000319
win,154,656,252,13,42,1117,0.013697,0.014396,0.00496,0.002412,0.002677
champions,107,589,309,20,28,1053,0.009517,0.012925,0.006082,0.003711,0.001784
team,88,402,321,21,102,934,0.007827,0.008822,0.006319,0.003897,0.006501
zidane,33,99,688,26,50,896,0.002935,0.002173,0.013543,0.004825,0.003187
final,245,416,152,29,14,856,0.021791,0.009129,0.002992,0.005381,0.000892
congratulations,4,664,154,3,12,837,0.000356,0.014571,0.003031,0.000557,0.000765
league,103,423,244,25,26,821,0.009161,0.009283,0.004803,0.004639,0.001657


## Word Cloud

In [10]:
pre_tweets = my_df_madrid[my_df_madrid.period == 1]
pre_string = []
for t in pre_tweets.clean_text:
    pre_string.append(t)
pre_string = pd.Series(pre_string).str.cat(sep=' ')

In [17]:
from wordcloud import WordCloud

wordcloud = WordCloud(width=1600, height=800,max_font_size=200).generate(pre_string)
plt.figure(figsize=(12,10))
plt.imshow(wordcloud, interpolation="bilinear")
plt.axis("off")
plt.show()

ImportError: DLL load failed: The specified module could not be found.

In [14]:
for t in pre_tweets.clean_text[:200]:
    if 'win' in t:
        print t

who wins
they re always been that nasty fuck win cost mentality they
on saturday see take final who victorious win europe greatest prize
who will win the champions league liverpool real madrid
salah score winner saturday win ko sunday anyone
love alot really hope wins think fans agree
would wonderful teams beginning win day john
who win
no sitting fence xabi who want win
he happy bro anything happen football real win year cl they
