# Data Exploration

### Context

On the occasion of the Womens day on March 8th, 2019, tweets were collected from Twitter to create the dataset with which the sentiment analysis of this work will be carried out.

This notebook will explore the dataset to get information about the event. 

We start by importing the libraries and configuring some settings:

In [None]:
import pandas as pd
import re
import string
import unicodedata
import matplotlib.pyplot as plt
from matplotlib.backends.backend_pdf import PdfPages
import nltk
from nltk.corpus import stopwords
from wordcloud import WordCloud
pd.set_option('max_colwidth',150)

Firstly, we load the processed data from the input csv file:

In [None]:
df = pd.read_csv("ejecuciones_codigo/3_preprocessed_tweets_8M.csv", sep="|", lineterminator='\n', low_memory=False)
df.head()
df.shape

- According to `Created date` fields, this is the first tweet extracted from the protest day.

In [None]:
first_tweet = df.sort_values(by=["created_date"],ascending=True)[["tweets", "created_date", "source", "RTs", "user_name", "user_location"]].head(1).iloc[0]
first_tweet

- 
This is the last one:

In [None]:
last_tweet = df.sort_values(by=["created_date"],ascending=True)[["tweets", "created_date", "source", "RTs", "user_name", "user_location"]].tail(1).iloc[0]
last_tweet

- This is the tweet with most retweets:

In [None]:
df.sort_values(by=["RTs"],ascending=False)[["tweets", "created_date", "source", "RTs", "user_name", "user_location"]].head(1).iloc[0]

- This is the tweets with more likes:

In [None]:
df.sort_values(by=["likes"],ascending=False)[["tweets", "created_date", "source", "RTs", "likes", "user_name", "user_location"]].head(1).iloc[0]

- This is the Top10 of most used sources to post on Twitter:

In [None]:
topsources_tweets = df.groupby("source")[["tweets"]].count()
topsources_tweets.columns = ["Number of Tweets"]
top10_sources = topsources_tweets.sort_values("Number of Tweets", ascending=False).head(10)
top10 = top10_sources.plot.barh(y="Number of Tweets")
top10_sources

Now, we will see the length of the tweets. Keep in mind that on previous notebooks *1_Preprocessing_Tweets*, some tweets have been removed due to exceed length. For this reason, the maximun length is under 140.

In [None]:
df["tweet_long"].plot(kind="hist" , bins=20 , figsize=(15,5))
plt.xlabel('tweet_long')
plt.ylabel('Frequency')
plt.title('Histogram of long')
plt.show()

Now, we are going to explore more frequent location where people had post on March 8th and 9th.

In [None]:
toplocations_tweets = df.groupby("user_location")[["tweets"]].count().dropna(how="all")
toplocations_tweets.columns = ["Number of Tweets"]
top10_locations = toplocations_tweets.sort_values("Number of Tweets", ascending=False).head(10)
top10_locations.plot.barh(y="Number of Tweets")
top10_locations

This plot shows that Madrid and Argentina are the places with more activity on Twitter on 8th and 9th of March 2019.

As we see above, there *user_location* is not normalizer because it is a text field were users can write down same location in different ways. 

For example:

`Madrid, Comunidad de Madrid` is same as `Madrid`

`Santiago, Chile` is equal to `Chile`

In [None]:
def delete_accent_mark(word):
    s = ''.join((c for c in unicodedata.normalize('NFD',word) if unicodedata.category(c) != 'Mn'))
    return s

def get_hashtags(sentence):
    hashtags = []
    puntuation = string.punctuation.replace("#", "") + "¿¡…“”"
    for word in sentence.split():
        if word.startswith("#"):
            word = word.lower()
            word = re.sub('\[.*?¿\]\%', ' ', word)
            word = re.sub('[%s]' % re.escape(puntuation), '', word)
            word = delete_accent_mark(word)
            hashtags.append(word)
    return hashtags

Now, we are going to explore most common hashtags on March 8th and 9th.

In [None]:
hashtags = list(map(lambda tweet: get_hashtags(tweet), df["tweets"]))


hashtags_tweets_general = []
for item in hashtags:
    for elem in item:
        if elem not in hashtags_tweets_general:
            hashtags_tweets_general.append(delete_accent_mark(elem))
        
hashtags_tweets_8m = []
for item in hashtags:
    for elem in item:
        if not elem.startswith("#8M") and not elem.startswith("#8m") and (elem not in hashtags_tweets_8m):
            hashtags_tweets_8m.append(delete_accent_mark(elem))

            
has    = " ".join(h for h in hashtags_tweets_general)            
has_8m = " ".join(h for h in hashtags_tweets_8m)
has_8m

We will represent the most popular hashtags in a worldcloud.

In [None]:
stop = nltk.corpus.stopwords.words("spanish")
wc_8m = WordCloud(width = 800, 
               height = 800, 
               background_color ='white', 
               stopwords = stop, 
               colormap ="Dark2",
               min_font_size = 10,
               max_font_size = 150,
               random_state = 42,
               normalize_plurals = True).generate(has_8m)

# plot the WordCloud image  
plt.imshow(wc_8m, interpolation ='bilinear')
plt.rcParams['figure.figsize'] = [15,15]
plt.axis("off")
plt.title("8M Hashtags WordCloud")
plt.show()      

In [None]:
stop = nltk.corpus.stopwords.words("spanish")
wc = WordCloud(width = 800, 
               height = 800, 
               background_color ='white', 
               stopwords = stop, 
               colormap ="Dark2",
               min_font_size = 10,
               max_font_size = 150,
               random_state = 42,
               normalize_plurals = True).generate(has)

# plot the WordCloud image  
plt.imshow(wc, interpolation ='bilinear')
plt.rcParams['figure.figsize'] = [15,15]
plt.axis("off")
plt.title("General 8M Hashtags WordCloud")
plt.show()  

Saving the image into a .png file:

In [None]:
wc.to_file("ejecuciones_codigo/8M_hashtags_wordcloud.png")
wc_8m.to_file("ejecuciones_codigo/General_8M_hashtags_wordcloud.png")