# **Intro to text analysis in Python. Day 5**

## *Dr Kirils Makarovs*

## *k.makarovs@exeter.ac.uk*

## *University of Exeter Q-Step Centre*

---


# **Welcome to Day 5!**

## **Today, you are going to work on:**

+ The final text data analysis report

---



# **1. Preparing to work in Python**

In [None]:
# Import the necessary libraries and functions

import pandas as pd
import numpy as np

import seaborn as sns
import matplotlib.pyplot as plt

import spacy
nlp = spacy.load('en')

from sklearn.feature_extraction.text import CountVectorizer

from textblob import TextBlob


---

# **2. Additional resources**

## **2.1 Python Data Science Handbook by Jake VanderPlas**

<figure>
<left>
<img src=https://jakevdp.github.io/PythonDataScienceHandbook/figures/PDSH-cover.png  width="500">
</figure>

*Book is available [here](https://jakevdp.github.io/PythonDataScienceHandbook/).*

*Reproducible notebooks for the handbook are available [here](https://github.com/jakevdp/PythonDataScienceHandbook/tree/master/notebooks).*

## **2.2 Text Analysis in Python for Social Scientists by Dirk Hovy**

<figure>
<left>
<img src=https://blackwells.co.uk/jacket/l/9781108819824.jpg width="500">
</figure>

*Reproducible notebooks for the handbook are available [here](https://github.com/dirkhovy/text_analysis_for_social_science).*

## **2.3 Introduction to SpaCy course by W. J. B. Mattingly**



*Available [here](http://spacy.pythonhumanities.com/intro.html).*

---

# **3. Exporting Colab notebooks**

+ *Rough-and-ready PDF:* `File -> Print`

(doesn't always work though and you may lose text if it goes beyond the page limits)

+ *Exporting as HTML:* Instructions with screenshots available [here](https://stackoverflow.com/questions/53460051/convert-ipynb-notebook-to-html-in-google-colab).

    + **Step #1:** Save a copy of the notebook in `.ipynb` format onto your machine and give it a different name. For this, use `File -> Download .ipynb`
    + **Step #2:** Upload the saved notebook into the current Colab session by going to `Files` and clicking on `Upload to session storage`
    + **Step #3:** The notebook should now be visible in the root folder of `Files`. Click on three dots and get the path to the notebook
    + **Step #4:** In the current notebook, create a new code cell and paste the following piece of code:

```
%%shell
jupyter nbconvert --to html <notebook_directory.ipynb>
```
+ 
  + **Step #5:** Change `<notebook_directory.ipynb>` to the pathway that you got in **Step #3** and run the code cell
  + **Step #6:** The HTML file should now be visible in the root folder of `Files`. Click on three dots and download it onto your machine
  + **Step #7:** Find the HTML file on your machine and click on it. It should open in browser
  + **Step #8 (additional):** You can save HTML file as PDF by clicking `ctrl/command + P` in your browser


In [None]:
# %%shell
# jupyter nbconvert --to html /content/html_day_5.ipynb

[NbConvertApp] Converting notebook /content/html_day_5.ipynb to html
[NbConvertApp] Writing 293516 bytes to /content/html_day_5.html




---

# **4. Data analysis report**




You are going to work with the dataset of song lyrics for two albums of **Radiohead** and **Coldplay**

The dataset contains the following variables:

+ **band**
+ **album**
+ **year**
+ **song_title**
+ **song_lyrics**
+ **lyrics_views** (proxy for popularity)


In [None]:
# Uploading the dataset with song lyrics into the current Google Colab session

from google.colab import files

uploaded = files.upload()


Saving song_lyrics_csv.csv to song_lyrics_csv.csv


In [None]:
# Let's get the dataset!

df = pd.read_csv('song_lyrics_csv.csv')

df.head(15)

df.tail(15)


Unnamed: 0,band,album,year,song_title,song_lyrics,lyrics_views
31,Coldplay,A Head Full of Dreams,2015,X Marks the Spot,"(So I race for it) Stare into darkness, starin...",39200
32,Coldplay,A Head Full of Dreams,2015,Amazing Day,"We sat on a roof, named every star Shared ever...",71200
33,Coldplay,A Head Full of Dreams,2015,Colour Spectrum,Because each Has been sent As a guide,26100
34,Coldplay,A Head Full of Dreams,2015,Up and Up,"Fixing up a car, driving it again Searching fo...",170700
35,Coldplay,Parachutes,2000,Don't Panic,"Bones, sinking like stones All that we fought ...",93800
36,Coldplay,Parachutes,2000,Shiver,So I look in your direction But you pay me no ...,105900
37,Coldplay,Parachutes,2000,Spies,"I awake to find no peace of mind I said, ""How ...",48600
38,Coldplay,Parachutes,2000,Sparks,Did I drive you away? I know what you'll say Y...,378800
39,Coldplay,Parachutes,2000,Yellow,"Look at the stars, look how they shine for you...",1200000
40,Coldplay,Parachutes,2000,Trouble,"Oh no, I see A spider web is tangled up with m...",165400


In [None]:
df.shape # 46 songs and 6 variables

df.columns # names of the variables

df.info()


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 46 entries, 0 to 45
Data columns (total 6 columns):
 #   Column        Non-Null Count  Dtype 
---  ------        --------------  ----- 
 0   band          46 non-null     object
 1   album         46 non-null     object
 2   year          46 non-null     int64 
 3   song_title    46 non-null     object
 4   song_lyrics   46 non-null     object
 5   lyrics_views  46 non-null     int64 
dtypes: int64(2), object(4)
memory usage: 2.3+ KB


In [None]:
# Get the head of the cleaned dataset

df.head(10)


---

# **5. Your task**

*For the final text data analysis report, please:*

1. form groups of 2-3 people
2. come up with a small research question that can be answered with the help of the provided dataset (we will discuss some of the examples in class)
3. produce 2-5 pieces of output (tables/graphs) that address your research question
4. interpret the tables and graphs that you have obtained (you don't have to write any text!)
5. send your Jupyter notebook to k.makarovs@exeter.ac.uk as a **single HTML file**
6. present your findings to the class and let's discuss them!






# **That's the end of Day 5!**