**LSE Data Science Institute | DS105M 2022 Week10**

**Topic:** Unstructured Data

**Author:** [@jonjoncardoso](github.com/jonjoncardoso)

**Date:** 29 November 2022

---

Obs: If you did not attend the lecture, you might notice a few gaps in your understanding when following this notebook. Watch the lecture recording.

# Why care about unstructured data?

Most datasets do not come in a tidy format that can fit perfectly well in a data frame (a structured data format). That is the case, for example, of **text data**.

# Working with Twitter Data

We will use [tweepy](https://docs.tweepy.org/en/stable/getting_started.html) library to access Twiter API.

The first thing we have to do is authenticate:

## 🚨🚨🚨🚨KEEP SECRETS OUT! 🚨🚨🚨🚨

EXTREMELY EXTREMELY IMPORTANT ADVICE:

- Don't use your SSH keys ANYWHERE in this notebook. Also, don't put them on Github either!!!
- Instead, create a `config.py` file somewhere outside this project (or .gitignore this file). See for example [this Stackoverflow link](https://stackoverflow.com/a/25501861/843365)

In [149]:
import config # This loads the content of the config.py file. If this throws an error, it is because you haven't created a config.py!

## Establish a connection

In [2]:
import tweepy

client = tweepy.Client(bearer_token=config.bearer_token, 
                       consumer_key=config.api_key, 
                       consumer_secret=config.api_key_secret, 
                       access_token=config.access_token, 
                       access_token_secret=config.access_token_secret)

## Obtain a few tweets just to test this works:

In [3]:
public_tweets = client.search_recent_tweets(query="Qatar")

# What is the format of the data returned?
public_tweets

Response(data=[<Tweet id=1597581927909060608 text='RT @VTVcanal8: #Qatar2022⚽| Aquí te mostramos los resultados del Mundial de Fútbol Qatar 2022, correspondientes a la jornada de este #28Nov…'>, <Tweet id=1597581927648731137 text='RT @DrBrunoGino: Ver o Eduardo Bolsonaro no Qatar hoje me lembrou que existem médicos PILANTRAS que deveriam estar de plantão em UPA do SUS…'>, <Tweet id=1597581926357159937 text='RT @VTVcanal8: #Qatar2022⚽| Estos son los encuentros de la tercera jornada de la Copa Mundial de la FIFA Qatar 2022.\n\nAquí las selecciones…'>, <Tweet id=1597581925157601282 text='netherlands-vs-qatar\nhttps://t.co/3egyyUiCcI\nhttps://t.co/3egyyUiCcI https://t.co/HqTgIye952'>, <Tweet id=1597581924880769028 text='Yyyyy... fue a pasear con tini es normal que haya recorrido todo qatar buscando shoppings de mármol https://t.co/9hrmtO4wEN'>, <Tweet id=1597581924247404544 text='@idextratime @bukalapak Belanda 4-0 Qatar #BukaAjaBukalapak'>, <Tweet id=1597581923995779072 text='@chollosdels

**What kind of data is returned?**

💡 From the [tweepy's Response documentation](https://docs.tweepy.org/en/stable/response.html#tweepy.Response), we read that this object is of a particular data type called [named tuple](https://realpython.com/python-namedtuple/#using-namedtuple-to-write-pythonic-code).  It is kind of a dictionary, there are fields with names that contain data inside them:

In [15]:
public_tweets._fields

('data', 'includes', 'errors', 'meta')

In [16]:
public_tweets.data

[<Tweet id=1597576494720888837 text='“Não pode isso e aquilo no Qatar…”\nMas eles não contavam q o brasileiro tem um coisa mto apaixonante: a alegria! https://t.co/J1T1b5l2Xg'>,
 <Tweet id=1597576494628208640 text='Netherlands wins \nNetherlands 2 - 1 Qatar https://t.co/DUtWhDfIBa'>,
 <Tweet id=1597576494553133056 text='RT @PMU_Sport: 🔮 Les prédictions de @ReveilAudrey &amp; de @NeauMali pour Pays-Bas // Qatar !\n\n🤔 Vous êtes #TeamPaysBas ou #TeamQatar ?\n🔃 RT +…'>,
 <Tweet id=1597576494511173632 text='RT @VTVcanal8: #Qatar2022⚽| Disfruta con nosotros todos los encuentros, análisis, estadísticas, mejores jugadas y mucho más con los mejores…'>,
 <Tweet id=1597576494112727040 text="Vraiment a mes yeux les matchs comme ça y a pas + inutile 😭 On sait tous très bien que le Qatar va ce faire ouvrir en 2 mais on s'emmerde a faire un match 😭 https://t.co/QV4xN7cYab">,
 <Tweet id=1597576493810475009 text='RT @cfootcameroun: «\xa0Je n’ai pas de problème avec mon grand frère Ernest Obama, j’ai é

In [21]:
public_tweets.meta

{'newest_id': '1597576494720888837',
 'oldest_id': '1597576492514410497',
 'result_count': 10,
 'next_token': 'b26v89c19zqg8o3fpzhm60iol0r4ixtrpxfityowxs1h9'}

**Interesting...**

🤔 Hmmm so we learn that our query has returned only 10 results and from the documentation, we read that `next_token` is used to paginate. This applies to all APIS: when in doubt, check the documentation!

Let's look at the next page then:

In [25]:
client.search_recent_tweets(query="Qatar", next_token='b26v89c19zqg8o3fpzhm60iol0r4ixtrpxfityowxs1h9')

Response(data=[<Tweet id=1597576491247624192 text="Migrant workers were deceived and died for Qatar's World Cup. Thousands want compensation https://t.co/8hwISaR45a">, <Tweet id=1597576490723704832 text='RT @binnahar85: Criticising Qatar this way or another is not an issue, it is actually a basic right.\nThe issue here is hypocrisy. Did u do…'>, <Tweet id=1597576490065219585 text='RT @PMU_Sport: 🔮 Les prédictions de @ReveilAudrey &amp; de @NeauMali pour Pays-Bas // Qatar !\n\n🤔 Vous êtes #TeamPaysBas ou #TeamQatar ?\n🔃 RT +…'>, <Tweet id=1597576490061033472 text='RT @_mydeszn: I’m live in Qatar. 🇶🇦❤️✨, and yes I’m actually a white Nigerian https://t.co/pH5WPgtqqs'>, <Tweet id=1597576488987291648 text='RT @HananyaNaftali: The best joke of the World Cup:\n\nQatar wants to lecture Israel on human rights. 😂 #WorldCup2022 #FIFA https://t.co/3pb4…'>, <Tweet id=1597576487745781760 text='RT @RusEmbIran: Moscow has supported the national team of 🇮🇷Iran before the decisive ⚽️football match with 

## What are people talking about LSE on Twitter?

In [9]:
tweet_fields=["id", "text", "attachments", "author_id", "context_annotations", "conversation_id", 
              "created_at", "entities", "in_reply_to_user_id", "lang", "public_metrics"]

In [10]:
lse_tweets = client.search_recent_tweets(query="London School of Economics", 
                                         tweet_fields=tweet_fields,
                                         max_results=100)

**Who tweeted that?**

In [150]:
lse_tweets.data[0].author_id

63684604

In [160]:
client.get_user(id=lse_tweets.data[0].author_id)

Response(data=<User id=63684604 name=Guardian Exec Jobs username=GJ_Exec>, includes={}, errors=[], meta={})

# Let's create a `tidy` dataframe

In [16]:
import pandas as pd

In [72]:
df = pd.concat([pd.DataFrame({field: tweet[field] for field in ["author_id", "text", "created_at", "lang"]}, index=[tweet["id"]]) for tweet in lse_tweets.data])

In [74]:
df

Unnamed: 0,author_id,text,created_at,lang
1597574950415638528,63684604,Director of International Programmes and Impac...,2022-11-29 12:55:24+00:00,en
1597574643489312769,236346971,RT @fascinatorfun: “That amounts to a loss of ...,2022-11-29 12:54:11+00:00,en
1597573860890931200,442976787,The London School of Economics has estimated t...,2022-11-29 12:51:04+00:00,en
1597572812923088897,276442577,RT @fascinatorfun: “That amounts to a loss of ...,2022-11-29 12:46:54+00:00,en
1597571481667768320,188690239,RT @lag_uk: 📢We are pleased to award the 2022 ...,2022-11-29 12:41:37+00:00,en
...,...,...,...,...
1597321824215650304,1159888310891945985,@philerator @phlannelphysics @bleasdale_r @The...,2022-11-28 20:09:34+00:00,en
1597318449336037376,4328654057,[Κλικ και διαβάστε στο ΠΡΩΤΟ ΘΕΜΑ]Δείτε live: ...,2022-11-28 19:56:09+00:00,el
1597317637117464577,1442770309694754821,Στην εκδήλωση του Ελληνικού Παρατηρητηρίου του...,2022-11-28 19:52:56+00:00,el
1597317226960502784,4328654057,[Κλικ και διαβάστε στο ΠΡΩΤΟ ΘΕΜΑ]Δείτε live: ...,2022-11-28 19:51:18+00:00,el


**Let's add the author_id username!**

There is a simpler way to do this, but I want to use this to demonstrate the concept of `merge`

In [156]:
df["author_id"].nunique()

91

In [162]:
all_authors = df["author_id"].unique()
#all_authors

**Use `list comprehension` to obtain author_usernames**

In [161]:
[client.get_user(id=author_id) for author_id in all_authors]

[Response(data=<User id=63684604 name=Guardian Exec Jobs username=GJ_Exec>, includes={}, errors=[], meta={}),
 Response(data=<User id=236346971 name=Bob Massam #FBPE🔶💙 username=bmassam>, includes={}, errors=[], meta={}),
 Response(data=<User id=442976787 name=Ros Chappell 🇺🇦StandWithUkraine ⭐️UK Rejoin 🇪🇺 username=RosChappell>, includes={}, errors=[], meta={}),
 Response(data=<User id=276442577 name=12Pat username=StarterPat>, includes={}, errors=[], meta={}),
 Response(data=<User id=188690239 name=Sam Halvorsen username=samhalvorsen>, includes={}, errors=[], meta={}),
 Response(data=<User id=1496957719982579712 name=M P username=mexxez16>, includes={}, errors=[], meta={}),
 Response(data=<User id=283549521 name=Matthew Aaron Richmond username=mattyrichy>, includes={}, errors=[], meta={}),
 Response(data=<User id=578444253 name=Dominique username=dominiquelevack>, includes={}, errors=[], meta={}),
 Response(data=<User id=283604227 name=Andy Vermaut username=AndyVermaut>, includes={}, e

**That took quite some time... How do we check if my code is stuck?**

💡Use a library called tqdm for progress bar:

In [184]:
import tqdm

author_usernames = [client.get_user(id=author_id) for author_id in tqdm.tqdm(all_authors)]



  0%|                                                                                                                                                               | 0/91 [00:00<?, ?it/s][A[A

TooManyRequests: 429 Too Many Requests
Too Many Requests

**Ok, but we've done the same thing but we did not save it anywhere**

In [174]:
df_authors = pd.DataFrame({"author_id": all_authors,
                           "author_username": author_usernames})

NameError: name 'author_usernames' is not defined

In [173]:
df = pd.merge(df, df_authors, how="left", on=["author_id"])

NameError: name 'df_authors' is not defined

# Extract tokens

Tokenisation is the process of segmenting text into words, punctuations marks etc.

We are going to use spaCy

## Which languages are involved?

In [185]:
df["lang"].value_counts()

en     56
el     39
und     2
de      1
in      1
es      1
Name: lang, dtype: int64

Check twitter API Documentation to check the languages: https://developer.twitter.com/en/docs/twitter-for-websites/supported-languages

## Let's focus on language='en' first

In [216]:
# Pick one random sample
sample = df.query("lang == 'en'").sample(1)

sample

Unnamed: 0,author_id,text,created_at,lang
1597336446532472834,19685010,RT @BrexitBin: In case you missed it ...\nHere...,2022-11-28 21:07:40+00:00,en


In [217]:
just_the_text = sample["text"].values[0]
just_the_text

"RT @BrexitBin: In case you missed it ...\nHere's an excerpt from a report into immigration and wages by the London School of Economics. It s…"

🗣️ **CLASSROOM DISCUSSION:** What does the following represent?

In [218]:
len(just_the_text)

140

In [219]:
type(just_the_text)

str

### Tokenization

**We need to load the language related features from spaCy**

In [220]:
from spacy.lang.en import English
language_parser = English()

tokenized_text = language_parser(just_the_text)
type(tokenized_text)

spacy.tokens.doc.Doc

🗣️ **CLASSROOM DISCUSSION:** What do you think the following represents?

In [221]:
len(tokenized_text)

31

In [222]:
for token in tokenized_text:
    print(token)

RT
@BrexitBin
:
In
case
you
missed
it
...


Here
's
an
excerpt
from
a
report
into
immigration
and
wages
by
the
London
School
of
Economics
.
It
s
…


**Silly way to count repeated tokens using `list comprehension`**

In [223]:
pd.Series([token for token in tokenized_text]).value_counts()

RT             1
report         1
s              1
It             1
.              1
Economics      1
of             1
School         1
London         1
the            1
by             1
wages          1
and            1
immigration    1
into           1
a              1
@BrexitBin     1
from           1
excerpt        1
an             1
's             1
Here           1
\n             1
...            1
it             1
missed         1
you            1
case           1
In             1
:              1
…              1
dtype: int64

**Let's change everything to lowercase:**

In [224]:
new_tokenized_test = language_parser(just_the_text.strip().lower())

pd.Series([token for token in new_tokenized_test]).value_counts()

rt             1
report         1
s              1
it             1
.              1
economics      1
of             1
school         1
london         1
the            1
by             1
wages          1
and            1
immigration    1
into           1
a              1
@brexitbin     1
from           1
excerpt        1
an             1
's             1
here           1
\n             1
...            1
it             1
missed         1
you            1
case           1
in             1
:              1
…              1
dtype: int64

### Lemmatization

https://spacy.io/usage/linguistic-features#lemmatization

There are **pre-trained** fancy NLP models that can detect more interesting things about our text.

In [235]:
# You need to download the suitable NLP models https://spacy.io/models/en
#!python -m spacy download en_core_web_sm

In [236]:
nlp = spacy.load("en_core_web_sm")

In [237]:
fancier_tokenized_text = nlp(just_the_text.strip().lower())

fancier_tokenized_text[6].lemma_

'miss'

In [240]:
my_tokens = [word.lemma_.lower().strip() if word.lemma_ != "-PRON-" else word.lower_ for word in fancier_tokenized_text ]

### Remove punctuation & stop words

In [242]:
from spacy.lang.en.stop_words import STOP_WORDS

# Create our list of punctuation marks
punctuations = string.punctuation

# Stop words for the English language
stop_words = spacy.lang.en.stop_words.STOP_WORDS

# Removing stop words
my_tokens = [ word for word in mytokens if word not in stop_words and word not in punctuations ]
my_tokens

['rt',
 '@lag_uk',
 '📢',
 'pleased',
 'award',
 '2022',
 'latin',
 'american',
 'geographies',
 'research',
 'group',
 'lagrg',
 'undergraduate',
 'dissertation',
 'prize',
 'sophia',
 '…']

In [243]:
len(my_tokens)

17

## Putting everything together: automating this process

Also, check [this tutorial](https://www.dataquest.io/blog/tutorial-text-classification-in-python-using-spacy/)

In [246]:
import string

from spacy.lang.en.stop_words import STOP_WORDS
from spacy.lang.en import English

# Create our list of punctuation marks
punctuations = string.punctuation

# Create our list of stopwords
nlp = spacy.load("en_core_web_sm")
stop_words = spacy.lang.en.stop_words.STOP_WORDS

def clean_text(tweet_text):
    simpler_text = tweet_text.strip().lower()
    mytokens = nlp(simpler_text)
    mytokens = [ word.lemma_.lower().strip() if word.lemma_ != "-PRON-" else word.lower_ for word in mytokens ]
    mytokens = [ word for word in mytokens if word not in stop_words and word not in punctuations ]
    return mytokens

Use pandas `apply` to make your code look cleaner!

In [278]:
df_en["text"].apply(clean_text)

1597574950415638528    [director, international, programme, impact, l...
1597574643489312769    [rt, @fascinatorfun, loss, 20, pound, value, d...
1597573860890931200    [london, school, economic, estimate, brexit, –...
1597572812923088897    [rt, @fascinatorfun, loss, 20, pound, value, d...
1597571481667768320    [rt, @lag_uk, 📢, pleased, award, 2022, latin, ...
1597568409012948997    [rt, @lag_uk, 📢, pleased, award, 2022, latin, ...
1597565420646518784    [london, school, economic, estimate, brexit, ....
1597561904792502272    [andy, vermaut, share, finance, employee, futu...
1597557869943357440    [rt, @lag_uk, 📢, pleased, award, 2022, latin, ...
1597557574110310400    [rt, @fascinatorfun, loss, 20, pound, value, d...
1597557330328985601    [rt, @fascinatorfun, loss, 20, pound, value, d...
1597553972101148673    [📢, pleased, award, 2022, latin, american, geo...
1597551218796400640    [rt, @fascinatorfun, loss, 20, pound, value, d...
1597550605513682944    [@danihas03237661, @rachelre

**There is a fancy tqdm for pandas**

In [250]:
from tqdm import tqdm
# from tqdm.auto import tqdm  # for notebooks

# Create new `pandas` methods which use `tqdm` progress
# (can use tqdm_gui, optional kwargs, etc.)
tqdm.pandas()

In [252]:
df_en["text"].progress_apply(clean_text)

100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 56/56 [00:00<00:00, 168.24it/s]


1597574950415638528    [director, international, programme, impact, l...
1597574643489312769    [rt, @fascinatorfun, loss, 20, pound, value, d...
1597573860890931200    [london, school, economic, estimate, brexit, –...
1597572812923088897    [rt, @fascinatorfun, loss, 20, pound, value, d...
1597571481667768320    [rt, @lag_uk, 📢, pleased, award, 2022, latin, ...
1597568409012948997    [rt, @lag_uk, 📢, pleased, award, 2022, latin, ...
1597565420646518784    [london, school, economic, estimate, brexit, ....
1597561904792502272    [andy, vermaut, share, finance, employee, futu...
1597557869943357440    [rt, @lag_uk, 📢, pleased, award, 2022, latin, ...
1597557574110310400    [rt, @fascinatorfun, loss, 20, pound, value, d...
1597557330328985601    [rt, @fascinatorfun, loss, 20, pound, value, d...
1597553972101148673    [📢, pleased, award, 2022, latin, american, geo...
1597551218796400640    [rt, @fascinatorfun, loss, 20, pound, value, d...
1597550605513682944    [@danihas03237661, @rachelre

# Now what

Now that you have cleaned and tokenized everything, you can do the fun stuff:

## Bag of Words 

Use `CountVectorizer` from [scikit-learn](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html) package to create a bag of words

In [253]:
from sklearn.feature_extraction.text import CountVectorizer

bow_vector = CountVectorizer(tokenizer = clean_text, ngram_range=(1,1))

In [255]:
bow_vector.fit_transform(df.query("lang == 'en'")["text"])

<56x377 sparse matrix of type '<class 'numpy.int64'>'
	with 945 stored elements in Compressed Sparse Row format>

In [265]:
bow_vector.fit_transform(df.query("lang == 'en'")["text"]).todense().shape

(56, 377)

In [259]:
#bow_vector.get_feature_names_out()

In [261]:
df_bag_words = pd.DataFrame(bow_vector.fit_transform(df.query("lang == 'en'")["text"]).todense(),
                            columns=bow_vector.get_feature_names_out())

**What are the most frequent tokens?**

In [264]:
df_bag_words.sum().sort_values(ascending=False).head(10)

london      52
school      51
economic    47
…           33
rt          33
wage        12
report      11
value       11
pound       11
miss        11
dtype: int64

🗣️ **CLASSROOM DISCUSSION** What would you do with this data now?

### PCA + plotly

In [268]:
!pip install plotly==5.11.0

^C


In [271]:
## Train an algorithm called PCA

## Read more about it here https://scikit-learn.org/stable/modules/decomposition.html#pca

from sklearn.decomposition import PCA

pca = PCA()
components = pca.fit_transform(df_bag_words)

In [304]:
df_pca = pd.DataFrame(components, columns=[f"PC{i+1}" for i in range(components.shape[1])])