# Lab 2a Analyzing Bokmässan Tweets

To learn more detailed Python programming (if you need to) you can also study chapters 1-9 from the free online book [Python for Everybody](https://www.py4e.com/html3/), however it is not essential to completing the labs.

Now you are familiar with Jupyter, we can now try out a simple analysis of Twitter data with a focus on data cleansing.

Firstly, we need to load the relevant Python modules that we will use in the code later by running the next code cell. In this cell, we use the `import` command to load five modules:

1. `pandas` - A collection of utilities to load and manipulate data tables.
1. `textmining` - Functions for statistical text mining, focused on the bag-of-words model.
1. `wordcloud` - A visualization module that generates wordclouds.
1. `matplotlib` -  A 2D plotting library to create figures such as charts and plots.
1. `sklearn` -  Scikit-learn, a library of machine learning algorithms.

In [None]:
# Run this cell to import the modules and set up some stuff
import pandas as pd
import textmining as tm
import wordcloud
import matplotlib
from sklearn.feature_extraction.text import CountVectorizer
# the following two lines set up our visualization settings
%matplotlib inline
matplotlib.pyplot.rcParams['figure.figsize'] = [10, 6]

## Analysis of Twitter data from the book fair

You have been hired as consultants by a book publisher who wants you to find out which themes and books have generated attention on Twitter during the 2016 book fair in Gothenburg.

Your task is to find out if there is any topic that has been particularly hot on Twitter before the book fair and during the book fair and to present a proposal to the company on what themes seem to create debate. In this lab we focus on data preparation. In order to prepare data, it is important to understand data

**Q1.** What do you think is distinctive Twitter data and how will this effect how we might want to pre-process the data?

Provide your answer by editing the cell below:

*Edit this cell to type your answer here*

*Hint: Double-click on this cell to edit it, then to exit edit-mode run the cell using the `Run` button in the toolbar above or press shift-return*.

## Data pre-processing

Often, the data to be analyzed must be cleansed before we can use it. Data cleansing can include tasks such as dealing with missing values or, as in our case, filtering out some parts of the raw text data. Data you have been provided with was collected from Twitter during the period May 2016 to September 2016 during the "Book Fair" event.

The data can be found in the `lab1-data.tsv` file.

### Loading data

Start by loading data from a `.tsv` file. We can load this data from the Cloud using a URL that points to the file. A TSV (tab-separated variable) file where each line in the file corrsponds to a row of a table, and each cell in every row is delimited with a tab character.

You can load data into a `DataFrame` object using `pandas` with the following command:

In [None]:
bok_tweets = pd.read_csv(
    "https://s3.eu-west-2.amazonaws.com/uu-datamining-assets/lab1-data.tsv", 
    encoding="utf-8", sep="\t"
)

In Jupyter you can inspect variable `bok_tweets` by executing a cell with the variable name, as follows:

In [None]:
bok_tweets

To inspect the column names, use the `columns` attribute.

**Q2** Replace the ellipsis `...` with `bok_tweets.columns` to assign the column list to the variable named `columns` and run the cell:

In [None]:
columns = ...
columns

To view the first few rows of a dataset, use the `head()`function with `3` as a parameter to tell the function to only load the first 3 records from `bok_tweets`. You can of course replace this with other numbers (try it):

In [None]:
bok_tweets.head(3)

To get some summary statistics on the dataset, use the `describe()` function:

In [None]:
bok_tweets.describe()

To get the shape of the dataset (length and width), use the `shape` attribute:

In [None]:
bok_tweets.shape

**Q3.** How many rows and columns are in the dataset? Provide your answer by replacing the ellipses `...` in the next cell with your answers.

In [None]:
number_of_rows = ...
number_of_columns = ...

To get general information about the data set, such as how many values are not empty, use the `info()` function:

In [None]:
bok_tweets.info()

**Q4.** How many tweets the dataset are mentions to another user (i.e. an `@`, or a mention, is when you include somebody's `@twittername` in the tweet? 

*Hint: The count of non-null objects in `info()` imply that of values present in a particular column.*

In [None]:
number_of_mentions = ...

**Q5.** Inspect the columns and contents of data. What part of data may be of interest for your analysis?

*Edit this cell to type your answer here*

A collection of documents containing text is usually called a corpus. We can create a corpus by extracting just the `text` column of `bok_tweets`. Pandas let's us do this by simply indexing the column using the dot accessor to the column name.

Run the next cell to extract the text from the data and create the corpus that you will work with:

In [None]:
tweets_corpus = bok_tweets.text
tweets_corpus

### Emojis

On Twitter it is common to use emojis 👍 ✨ 🐫 🎉 🚀 🤘.

When doing text analysis this can be useful because an emoji can contain a lot of information about what a person who wrote something means and what tone the text has. However, emojis may be problematic during analyses since coding of these is not necessarily compatible with the processing modules like NLTK.

Unfortunately, emojis creates problems for the features we use in this lab 😭 and you will therefore need to filter out emojis from the raw data.

Run the following cell that removes emojis from the `tweets` corpus:

In [None]:
encode2ascii = lambda x: x.encode('ascii', errors='ignore').decode('utf-8')
clean_tweets = tweets_corpus.apply(encode2ascii)
clean_tweets

**Q6.** How might removing emojis effect the quality of analysis? Explain your answer.

*Edit this cell to type your answer here*

### Remove URLs
On Twitter, it is common to link to locations on the Web using URLs. It is often the case that commonly occuring parts of URLs will end up among the most frequent words. It is therefore important to filter them out.

We can remove content that matches URL patterns with the following command:

In [None]:
clean_tweets = clean_tweets.str.replace(r'http\S+', '')
clean_tweets

**Q7** How might removing URLs effect the quality of analysis? Explain your answer.

*Edit this cell to type your answer here*

### Function for most frequent words

We will look for the most frequent words several times during this lab after each pre-processing step in order to compare the affect of the pre-processing. We will do the same operations several times, so therefore we will create a couple of functions to help us with our analysis.

#### What is a Term Document Matrix?

First, we create a term-document matrix (TDM), which can also referred to as a document-term matrix (DTM). A TDM gives us a table of the number of instances of a word for each document in a corpus. You should start by creating a TDM that is a representation of each tweet in terms of a feature vector. The feature vector creates an element for each word (unless excluded in the pre-processing, see further below). Thus, each element in the feature vector represents a word contained in one of the tweets. In the TDM you create, each line corresponds to the text of a tweet where all words that are not filtered out of the tweet are saved in the corresponding elements in the feature vector.

Our function `create_term_document_matrix()` to create a TDM is as follows:

In [None]:
def create_term_document_matrix(corpus, min_df=1):
    cvec = CountVectorizer(min_df=min_df, stop_words=tm.stopwords)
    tfmatrix = cvec.fit_transform(corpus)
    return pd.DataFrame(data=tfmatrix.toarray(), columns=cvec.get_feature_names())

For example, we can create a TDM for only the first three documents in our tweets corpus by using the `.head(3)` function on the `tweets_corpus`, similar to what we did at the beginning with `bok_tweets`.

Replace the ellipsis with in the next cell with the documents we wish to pass to the `create_term_document_matrix()` function:

In [None]:
create_term_document_matrix( ... )

**Q8.** How many columns are created for our small TDM above?

In [None]:
number_of_tdm_columns = ...

To find the top words we will do a bit more work with our next function `plot_top_words()`. In this function we sum up each of the columns in the TDM, sort the word frequencies by counts, return the top sorted words list, and additionally plot these words in a nice bar chart.

In [None]:
def plot_top_words(tweets, num_word_instances, top_words):
    tdm_df = create_term_document_matrix(tweets, min_df=2)
    word_frequencies = tdm_df[[x for x in tdm_df.columns if len(x) > 1]].sum()
    sorted_words = word_frequencies.sort_values(ascending=False)
    top_sorted_words = sorted_words[:num_word_instances]
    top_sorted_words[:top_words].plot.bar()
    return top_sorted_words

After defining our own `plot_top_words()` function, we can use it by using the tweets corpus as input (be patient as it make take some time for the function to complete processing):

In [None]:
top_words = plot_top_words(clean_tweets, 50, 30)
top_words

**Q9** How many times must a word occur in your corpus for the function to appear in the top words list output above?

In [None]:
min_occurences_to_make_top_50_words = ...

**Q10.** How many words does the function plot in the bar chart?

In [None]:
number_of_words_plotted_in_bar_chart = ...

### Lowercase

The next step is to redo all the words in lowercase letters. You do this to avoid identiftyinhg the same words as different ones, when written in different cases. For example before transforming the whole corpus into lowercase letters, the word `Bokmaessan`and `bokmaessan` may be identified as different words.

To change the `clean_tweets` corpus into lower case text, use the following command:

In [None]:
tweets_lowered = clean_tweets.str.lower()

In the next cell, write some code to plot the top words again with the lowered tweets corupus. Replace the ellipsis (`...`) with your own code:

In [None]:
top_words_lowered = ...
top_words_lowered

**Q11.** What do you observe in the data after plotting the lowered tweets, and why?

*Edit this cell to type your answer here*

In the next cell, you can use the code to compare your different lists with the most common words. The code creates a table using the `DataFrame` class, with the indexes of both top tweets corpuses as inputs.

This code lets us preview the top 20 tweets, where the range given in the square brackets `[0:20]` defines which part of the top words lists are used. For example `[5:40]` will give you the 5th to 40th words in the list. You can try changing the range values if you like.

In [None]:
pd.DataFrame({
    'Top tweeted clean': top_words[0:20].index,
    'Top tweeted lowered': top_words_lowered[0:20].index
})

You can use the following code to check the words in two top words lists they are identical:

In [None]:
set(top_words[0:20].index) - set(top_words_lowered[0:20].index)

If the lists of top words are identical, the cell will return only `set()`, otherwise it will list the words that are different.

**Q12.** Has the pre-processing of data you performed so far changed the list of the 20 most frequent words? Provide a reason for your observation.

*Edit this cell to type your answer here*

### Small words

Most small words are usually of limited importance, so let's strip those out. We can simply use a regular expression to identify words that are 3 letters long and keep them in the corpus.

In [None]:
tweets_low_no_small = tweets_lowered.str.findall('\w{3,}').str.join(' ')  # short words (2 chars)

In [None]:
top_words_low_no_small = plot_top_words(tweets_low_no_small, 50, 30)
top_words_low_no_small

**Q13.** Now after removing short words, how many times must a word occur in your corpus for the function to appear in the top words list output above?

In [None]:
min_occurences_to_make_top_50_words_short_words = ...

### Stop Words

Stop words are words of limited importance and are therefore not so interesting for your analysis. We use stop words as a reference so that we can filter out words that we do not want to analyze, for example prepositions and conjunctions.

First, we can create a list of stopwords that we can use to remove from the most frequent word collection:

In [None]:
my_stop_words = ["och", "det", "att", "i", "en", "jag", "hon", 
                "som", "han", "paa", "den", "med", "var", "sig", 
                "foer", "saa", "till", "aer", "men", "ett", 
                "om", "hade", "de", "av", "icke", "mig", "du", 
                "henne", "daa", "sin", "nu", "har", "inte", 
                "hans", "honom", "skulle", "hennes", "daer", 
                "min", "man", "ej", "vid", "kunde", "naagot", 
                "fraan", "ut", "naer", "efter", "upp", "vi", 
                "dem", "vara", "vad", "oever", "aen", "dig", 
                "kan", "sina", "haer", "ha", "mot", "alla", 
                "under", "naagon", "eller", "allt", "mycket", 
                "sedan", "ju", "denna", "sjaelv", "detta", 
                "aat", "utan", "varit", "hur", "ingen", "mitt", 
                "ni", "bli", "blev", "oss", "din", "dessa", 
                "naagra", "deras", "blir", "mina", "samma", 
                "vilken", "er", "saadan", "vaar", "blivit", 
                "dess", "inom", "mellan", "saadant", "varfoer", 
                "varje", "vilka", "ditt", "vem", "vilket", 
                "sitta", "saadana", "vart", "dina", "vars", 
                "vaart", "vaara", "ert", "era", "vilka"]

Then we can define a function `remove_stopwods` that removes the stop words from a single document. We then use the `.apply()` function to apply the function across the whole of the `tweets_lowered_no_urls` corpus to delete these words:

In [None]:
remove_stopwords = lambda x: ' '.join(y for y in x.split() if y not in my_stop_words)
tweets_low_no_small_stopwords = tweets_low_no_small.apply(remove_stopwords)

In [None]:
top_words_low_no_small_stopwords = plot_top_words(tweets_low_no_small_stopwords, 50, 30)
top_words_low_no_small_stopwords

**Q14.** Now after removing stop words, how many times must a word occur in your corpus for the function to appear in the top words list output above?

In [None]:
min_occurences_to_make_top_50_words_short_stop_words = ...

**Q15.** What are the differences between the most frequent words compared to the previous pre-processed lists?

*Hint: To help with your answer, read a little further to create a comparison table.*

*Edit this cell to type your answer here*

Write some code to create a table showiung the top 20 words at each stage of pre-processing by comparing `top_words_lowered`, `top_words_low_no_small` and `top_words_low_no_small_stopwords` to help you answer this question.

In the code cell below replace the ellipsis with your own code and run it to create your comparison table:

In [None]:
...

### Add your own stopwords

Now you can choose to add your own stop words if you think there are words in the graph that are not so informative to determine what kind of topics discussed at the book fair. For example, you could remove `years` as represented in the text with` aar`. Write your own code in the cell below and run it to remove your own stop words:

*Check the earlier example code that removes the initial list of stop words if you are not sure how to do this.*

In [None]:
more_stop_words = [ ... ]

remove_more_stopwords = lambda x: ' '.join(y for y in x.split() if y not in more_stop_words)
tweets_low_no_small_more_stopwords = tweets_low_no_small_stopwords.apply(remove_more_stopwords)
top_words_no_small_more_stopwords = plot_top_words(tweets_low_no_small_more_stopwords, 50, 30)
top_words_no_small_more_stopwords

**Q16.** What stop words did you add and why? Did you notice any further problems?

*Edit this cell to type your answer here*

### Visualization of analysis and recommendation

Now you will create a visualization that will help you convince the company why they should focus on this particular topic. A common way of visualizing commonly used words in a text is by using a word cloud.

You create a word cloud using the following code:

In [None]:
from wordcloud import WordCloud
import matplotlib.pyplot as plt
wordcloud = WordCloud(max_font_size=40)
wordcloud.fit_words(top_words_no_small_more_stopwords.to_dict())
plt.figure()
plt.imshow(wordcloud, interpolation="bilinear")
plt.axis("off")
plt.show()

The code above creates a word cloud with words from the `top_words_no_small_more_stopwords` list. Run the next cell to generate a word cloud with the `top_words_low_no_small_stopwords` list.

In [None]:
from wordcloud import WordCloud
import matplotlib.pyplot as plt
wordcloud = WordCloud(max_font_size=40)
wordcloud.fit_words(top_words_low_no_small_stopwords.to_dict())
plt.figure()
plt.imshow(wordcloud, interpolation="bilinear")
plt.axis("off")
plt.show()

### Compare your word clouds

Create word clouds for at least two of your top words lists to compare how the pre-processing has affected the word clouds. You can also change the minimum frequency for a word to end up in the word cloud. If you think any words should be deleted, you can go back to an earlier step and add it to your stop word list and re-run the cells afterwards.

**Q17.** Are there any words that are not as informative that you removed to improve visualization? Explain why you removed any additional words.

*Edit this cell to type your answer here*

**Q18.** What theme would you recommend the book publisher to target next year? Explain your answer.

*Edit this cell to type your answer here*

**Q19.** Now that you have explored some Twitter data, what do you now think are the interesting characteristics of this kind of data? How does it affect how you must pre-process data?

*Edit this cell to type your answer here*

---
You're done with Lab 2a!

Choose **Save and Checkpoint** from the **File** menu to save your work.

If you are running the labs in Binder (on the cloud), then choose **Download as Notebook** and save it to your computer. 

You can then move on to [Lab 2b](Lab2b_Association_analysis_for_MatFörAlla.ipynb).