# Analysis of Tweets regarding Charlottesville Rally

"*The Unite the Right rally was a white supremacist rally that took place in Charlottesville, Virginia, from August 11 to 12, 2017. Far-right groups participated, including self-identified members of the alt-right,neo-Confederates,neo-fascists, white nationalists, neo-Nazis,and various right-wing militias*" 
(from Wikipedia: https://en.wikipedia.org/wiki/Unite_the_Right_rally).

The dataset `aug15_sample.csv` contains tweets shared regarding this event.

Run the following (after un-commenting them) if your installation of Python does not have Seaborn

In [None]:
#import sys
#!{sys.executable} -m pip install seaborn

In [None]:
import nltk
import string
import pandas as pd
import re
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
from collections import Counter

tweets = pd.read_csv('aug15_sample.csv')

Use the '`.head()`' `Pandas` function to display the first entries of the dataset

In [None]:
tweets.head()

Let's remove stopwords, such as characters that do not hold any specific meaning. These include punctuations, brackets etc.

In [None]:
top_N = 30
stopwords = nltk.corpus.stopwords.words('english')
RE_stopwords = r'\b(?:{})\b'.format('|'.join(stopwords))
words = (tweets['full_text']
           .str.lower()
           .replace([r'\|', RE_stopwords, r"(&amp)|,|;|\"|\.|\?|’|!|'|:|-|\\|/|https"], [' ', ' ', ' '], regex=True)
           .str.cat(sep=' ')
           .split()
)

rslt = pd.DataFrame(Counter(words).most_common(top_N),
                    columns=['Word', 'Frequency']).set_index('Word')

rslt = rslt.iloc[1:]

What is `rslt` ? Print it and display its `type'

In [None]:
print(rslt)

In [None]:
print(type(rslt))

The below cell will display the word frequency values, using the Python module `seaborn`.

In [None]:
sns.set_style("whitegrid")
plt.rcParams["figure.figsize"] = [30.0, 20.0]
ax = sns.barplot(y=rslt.index, x='Frequency', data=rslt)
ax.set_xlabel("Frequency",fontsize=25)
ax.set_ylabel("Words",fontsize=25)
ax.tick_params(labelsize=30)

Research the Python module `seaborn`, and try different visualisations

In [None]:
# Add your code here

Now, let's identify the different hashtags

In [None]:
#hashtags = hashtags[~hashtags.isnull()]
tags = (tweets['hashtags']
           .str.lower()
           .str.cat(sep=' ')
           .split()
)

hashtgs = pd.DataFrame(Counter(tags).most_common(top_N),
                    columns=['Hashtags', 'Frequency']).set_index('Hashtags')
hashtgs = hashtgs.iloc[1:]


Investigate the variable type of `hashtgs` and print its content

In [None]:
print(type(hashtgs))
print(hashtgs)

Modify the above code to visualise the `hashtgs` values using Seaborn

In [None]:
sns.set_style("whitegrid")
ax = sns.barplot(y=hashtgs.index, x='Frequency', data=hashtgs)
ax.set_xlabel("Frequency",fontsize=25)
ax.set_ylabel("Hashtag",fontsize=25)
ax.tick_params(labelsize=30)

In [None]:
tweets['created_at'] = pd.to_datetime(tweets['created_at'])
tweets = tweets.set_index('created_at')

In [None]:
tweets_timestamp = tweets[['id']]
tweet_volume = tweets_timestamp.resample('10min').count()

Answer the following questions:

* What does `resample('10min')` do?
* What does `count()` do?
* Describe the difference between `tweets_timestamp` and `tweet_volume` by visualising them.

In [None]:
print(tweets_timestamp)

In [None]:
print(tweet_volume)

In [None]:
ax = sns.pointplot(x=tweet_volume.index, y='id', data=tweet_volume)
ax.set_xlabel("Timestamp of tweets",fontsize=30)
ax.set_ylabel("Number of tweets",fontsize=30)

ax.tick_params(labelsize=25)

for item in ax.get_xticklabels():
    item.set_rotation(90)

Let's now find the most influential tweets

In [None]:
influential = tweets[['user_name', 'followers_count']]


Print the first entries

In [None]:
influential.head()

Let's now group the above in *ascending* order using the following commands

In [None]:
influential = influential.sort_values('followers_count', ascending=False)
influential.groupby('user_name').first().sort_values(by='followers_count', ascending=False)[:10]

These are the users who tweeted the most during the last 3 hours. 

Now, print the top 20 people and number of tweets they have tweeted.

In [None]:
tweets['screen_name'].value_counts()[:20]

In [None]:
# clustering algorithms 
# from http://ahmedbesbes.com/how-to-mine-newsfeed-data-and-extract-interactive-insights-in-python.html

pd.options.mode.chained_assignment = None
# nltk for nlp
from nltk.tokenize import word_tokenize, sent_tokenize
from nltk.corpus import stopwords
# list of stopwords like articles, preposition
stop = set(stopwords.words('english'))
from string import punctuation
from collections import Counter

def tokenizer(text):
    try:
        tokens_ = [word_tokenize(sent) for sent in sent_tokenize(text)]
        
        tokens = []
        for token_by_sent in tokens_:
            tokens += token_by_sent

        tokens = list(filter(lambda t: t.lower() not in stop, tokens))
        tokens = list(filter(lambda t: t not in punctuation, tokens))
        tokens = list(filter(lambda t: t not in [u"'s", u"n't", u"...", u"''", u'``', u'amp', u'https',
                                                u'via', u"'re"], tokens))
        filtered_tokens = []
        for token in tokens:
            if re.search('[a-zA-Z]', token):
                filtered_tokens.append(token)

        filtered_tokens = list(map(lambda token: token.lower(), filtered_tokens))

        return filtered_tokens
    except Error as e:
        print(e)

In [None]:
tweets['tokens'] = tweets['full_text'].map(tokenizer)

In [None]:
for full_text, tokens in zip(tweets['full_text'].head(5), tweets['tokens'].head(5)):
    print('full text:', full_text)
    print('tokens:', tokens)
    print() 

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer

# min_df is minimum number of documents that contain a term t
# max_features is maximum number of unique tokens (across documents) that we'd consider
# TfidfVectorizer preprocesses the descriptions using the tokenizer we defined above

vectorizer = TfidfVectorizer(min_df=10, max_features=10000, tokenizer=tokenizer, ngram_range=(1, 2))
vz = vectorizer.fit_transform(list(tweets['full_text']))

In [None]:
tfidf = dict(zip(vectorizer.get_feature_names(), vectorizer.idf_))
tfidf = pd.DataFrame(columns=['tfidf']).from_dict(dict(tfidf), orient='index')
tfidf.columns = ['tfidf']

In [None]:
tfidf.tfidf.hist(bins=50, figsize=(15,7))

In [None]:
tfidf.sort_values(by=['tfidf'], ascending=False).head(30)

In [None]:
import warnings
warnings.filterwarnings("ignore", category=DeprecationWarning)

from sklearn.cluster import MiniBatchKMeans

num_clusters = 10
kmeans_model = MiniBatchKMeans(n_clusters=num_clusters, init='k-means++', n_init=1, 
                         init_size=1000, batch_size=1000, verbose=False, max_iter=1000)
kmeans = kmeans_model.fit(vz)
kmeans_clusters = kmeans.predict(vz)
kmeans_distances = kmeans.transform(vz)

I used k-means clustering algorithms to generate a list of words that appear frequently together, and the results are shown above.
You can see that there is a thread of conversation that we couldn’t detect from the word frequency list. One example of this is Cluster #3, where a pocket of people expressed their displeasure with CNN coverage. K-means clustering is surely a great way to complement our word frequency tally.

In [None]:
sorted_centroids = kmeans.cluster_centers_.argsort()[:, ::-1]
terms = vectorizer.get_feature_names()
for i in range(num_clusters):
    print("Cluster %d:" % i)
    aux = ''
    for j in sorted_centroids[i, :10]:
        aux += terms[j] + ' | '
    print(aux)
    print() 