# @robotodio - tweet extractor

---

## Libraries

In [74]:
# Python libraries
# ------------------------------------------------------------------------------
# Reading files with different formats
import json

# Data wrangling
import re
import pandas as pd
import numpy as np

# Twitter API
import tweepy

# NLP
from textblob import TextBlob

# Hate speech detection
from detoxify import Detoxify

## Download data from twitter

The tweepy API class provides access to the entire Twitter RESTful API methods. Each method can accept various parameters and return responses.

### API set up

In [2]:
# API Twitter credentials
# ------------------------------------------------------------------------------
# Open .json file containing credentials/tokens as a dictionary
with open("twitter_api_keys.json") as file:
    api_credentials = json.load(file)
    
# Assign each value of the dictionary to a new variable
consumer_key = api_credentials['consumer_key']
consumer_secret = api_credentials['consumer_secret']
access_token = api_credentials['access_token']
access_token_secret = api_credentials['access_token_secret']

In [3]:
# API set up
# ------------------------------------------------------------------------------
# Create a handler instance with key and secret consumer, and pass the tokens
auth = tweepy.OAuthHandler(consumer_key, consumer_secret)
auth.set_access_token(access_token, access_token_secret)
    
# Construct the API instance
api = tweepy.API(auth, wait_on_rate_limit=True, wait_on_rate_limit_notify=True)

# Check credentials
if(api.verify_credentials):
    print("Logged In Successfully")
else:
    print("Error -- Could not log in with your credentials")

Logged In Successfully


### Extract tweets and associated metadata

Once the `tweepy.API` class has been instantiated we can call its methods to scrap desired information in Twitter.

Pagination is used a lot in Twitter API development. Iterating through timelines, user lists, direct messages, etc. In order to perform pagination, we must supply a page/cursor parameter with each of our requests. The problem here is this requires a lot of boiler plate code just to manage the pagination loop. To help make pagination easier and require less code, Tweepy has the Cursor object.

In first place, we are going to define a function that returns an iterator of tweets. We will call this fuction each time we want to download tweets from an user account.

In [4]:
# Tweets list iterator
# ------------------------------------------------------------------------------
def tweets_iterator(target, n_items):
    '''
    Returns an iterator of tweets.

        Parameters:
            target (str): The user name of the Twitter account.
            n_items (int): Number of tweets downloaded.

        Returns:
            tweets (ItemIterator): an iterator of tweets
    '''
    # Instantiate the iterator
    tweets = tweepy.Cursor(
        api.user_timeline,
        screen_name=target,
        include_rts=False,
        exclude_replies=False,
        tweet_mode='extended').items(n_items)
    
    # Returns iterator
    return tweets

In the next cell we can see an example of all the metadata that contains one tweet.

In [80]:
# Introduce the target Twitter account and number of items to download
target = 'DFD_74'
n_items = 20

# Tweets list (iterator)
tweets = tweets_iterator(target, n_items)

for k, tweet in enumerate(tweets):
    print(json.dumps(tweet._json, indent=4))
    if k == 1:
        break

{
    "created_at": "Sun Feb 14 14:51:46 +0000 2021",
    "id": 1360964958834614273,
    "id_str": "1360964958834614273",
    "full_text": "Cuando estaba en el hospital muy mal con el covid solo pensaba en pasar este d\u00eda con mi se\u00f1ora y mis hijos. Hoy se ha cumplido \nGracias vida por seguir ah\u00ed \ud83d\udcaa",
    "truncated": false,
    "display_text_range": [
        0,
        157
    ],
    "entities": {
        "hashtags": [],
        "symbols": [],
        "user_mentions": [],
        "urls": []
    },
    "source": "<a href=\"http://twitter.com/download/android\" rel=\"nofollow\">Twitter for Android</a>",
    "in_reply_to_status_id": null,
    "in_reply_to_status_id_str": null,
    "in_reply_to_user_id": null,
    "in_reply_to_user_id_str": null,
    "in_reply_to_screen_name": null,
    "user": {
        "id": 908629539609341952,
        "id_str": "908629539609341952",
        "name": "Espa\u00f1ol74 (david) \ud83c\uddea\ud83c\uddf8",
        "screen_name": "DFD_7

As we can see in the previous output cell, each tweet contains a lot of metadata. All this information comes in .json format, and we can access to it with Tweepy methods, or with dictionary methods.

In the next cell we are going to extract some metadata from each tweet, and export to a Pandas DataFrame format.

In [155]:
# Tweet extractor
# ------------------------------------------------------------------------------
# Tweets list (iterator)
tweets = tweets_iterator(target, n_items)

# Read through the iterator, and export the info to a Pandas DataFrame
all_columns = [np.array([
    tweet.full_text,
    tweet.user.screen_name,
    tweet.id,
    tweet.source,
    tweet.created_at,
    len(tweet.full_text),
    tweet.favorite_count,
    tweet.retweet_count,
    str(tweet.entities['hashtags'])
]) for tweet in tweets]

# Export the list of tweets to a dataframe
df = pd.DataFrame(
    data=all_columns,
    columns=['tweet', 'id', 'account', 'source', 'date', 'length', 'likes',
             'RTs', 'hashtags']
)

df.head()

Unnamed: 0,tweet,id,account,source,date,length,likes,RTs,hashtags
0,Cuando estaba en el hospital muy mal con el co...,DFD_74,1360964958834614273,Twitter for Android,2021-02-14 14:51:46,157,18,2,[]
1,En el parlamento catalán están los golpistas d...,DFD_74,1360960796679954437,Twitter for Android,2021-02-14 14:35:14,215,1,0,[]
2,@Chaskanaui1 Vamos mejorando. Despacio pero firme,DFD_74,1360960263181238278,Twitter for Android,2021-02-14 14:33:06,49,1,0,[]
3,@pigdemont_ @pardodevera Es una doña nadie. Va...,DFD_74,1360936343824723968,Twitter for Android,2021-02-14 12:58:03,75,3,0,[]
4,Hay gente asquerosa\nHay gente miserable \nY l...,DFD_74,1360935288235847680,Twitter for Android,2021-02-14 12:53:52,74,59,19,[]


### Clean and translation

Once we have a dataset with all the tweets of an user, we need to prepare and clean them to serve as input to the model. For each tweet we are going to eliminate special string characters, and we will translate them into English in case they are not.

In [156]:
# Characters to remove
spec_chars = ['\n', '\t', '\r']

In [157]:
# Replace defined characters with a whitespace
for char in spec_chars:
    df['tweet'] = df['tweet'].str.strip().replace(char, ' ')

Because we replaced the special characters with a whitespace, we might end up with double whitespaces in some values. Let’s remove them by splitting each tweet using whitespaces and re-joining the words again using join.

In [158]:
# Split and re-join each tweet
df['tweet'] = df['tweet'].str.split().str.join(" ")

In [159]:
df.head()

Unnamed: 0,tweet,id,account,source,date,length,likes,RTs,hashtags
0,Cuando estaba en el hospital muy mal con el co...,DFD_74,1360964958834614273,Twitter for Android,2021-02-14 14:51:46,157,18,2,[]
1,En el parlamento catalán están los golpistas d...,DFD_74,1360960796679954437,Twitter for Android,2021-02-14 14:35:14,215,1,0,[]
2,@Chaskanaui1 Vamos mejorando. Despacio pero firme,DFD_74,1360960263181238278,Twitter for Android,2021-02-14 14:33:06,49,1,0,[]
3,@pigdemont_ @pardodevera Es una doña nadie. Va...,DFD_74,1360936343824723968,Twitter for Android,2021-02-14 12:58:03,75,3,0,[]
4,Hay gente asquerosa Hay gente miserable Y lueg...,DFD_74,1360935288235847680,Twitter for Android,2021-02-14 12:53:52,74,59,19,[]


Finally we translate them into English.

In [160]:
texto = "Esta frase está en español."
traducido = TextBlob(texto).translate(to="en")
traducido

TextBlob("This phrase is in Spanish.")

In [161]:
df['language'] = df['tweet'].apply(lambda tweet: TextBlob(tweet).detect_language())

In [162]:
df

Unnamed: 0,tweet,id,account,source,date,length,likes,RTs,hashtags,language
0,Cuando estaba en el hospital muy mal con el co...,DFD_74,1360964958834614273,Twitter for Android,2021-02-14 14:51:46,157,18,2,[],es
1,En el parlamento catalán están los golpistas d...,DFD_74,1360960796679954437,Twitter for Android,2021-02-14 14:35:14,215,1,0,[],es
2,@Chaskanaui1 Vamos mejorando. Despacio pero firme,DFD_74,1360960263181238278,Twitter for Android,2021-02-14 14:33:06,49,1,0,[],es
3,@pigdemont_ @pardodevera Es una doña nadie. Va...,DFD_74,1360936343824723968,Twitter for Android,2021-02-14 12:58:03,75,3,0,[],es
4,Hay gente asquerosa Hay gente miserable Y lueg...,DFD_74,1360935288235847680,Twitter for Android,2021-02-14 12:53:52,74,59,19,[],es
5,@ManuHdez14 Envidia sana Ya podré ir en otras ...,DFD_74,1360934107837005828,Twitter for Android,2021-02-14 12:49:10,63,0,0,[],es
6,@ldpsincomplejos De toda esta lista solo me in...,DFD_74,1360932661603270661,Twitter for Android,2021-02-14 12:43:26,101,5,1,[],es
7,Mucho ánimo a los afiliados que habéis ido a C...,DFD_74,1360930687822540802,Twitter for Android,2021-02-14 12:35:35,163,53,16,[],es
8,Vamos catalanes !! Hay que ir a votar Ánimo,DFD_74,1360852311808090112,Twitter for Android,2021-02-14 07:24:09,45,27,6,[],es
9,@VidalQuadras Deben hacer el esfuerzo si. Y vo...,DFD_74,1360845887870746625,Twitter for Android,2021-02-14 06:58:37,55,8,1,[],es


In [169]:
prueba = ['frase 1', 'frase 2']

prueba_en = []

for i in prueba:
    i_en = TextBlob(i).translate(from_lang='es', to='en')
    prueba_en.append(i_en)

prueba_en

[TextBlob("sentence 1"), TextBlob("sentence 2")]

Hay que mirar como extraer el texto de esta lista...

In [138]:
tweets = np.array(df['tweet'])

In [151]:
tweets_en = list(map(lambda tweet: TextBlob(tweet).translate(from_lang='es', to='en'), tweets))

In [152]:
tweets_en

[TextBlob("When I was in the hospital very badly with the covid I only thought about spending this day with my wife and my children. Today has been fulfilled Thank you life for continuing there 💪"),
 TextBlob("In the Catalan parliament there are the ERC coup plotters, the fugitives and accused of corruption of the PdCat, the PSC garbage and the assholes of Podemos or the CUP, but you will see how tonight the problem will be Vox 🤣🤣🤣"),
 TextBlob("@ Chaskanaui1 We are improving. Slow but steady"),
 TextBlob("@pigdemont_ @pardodevera She is a nobody. Come on, a zero to the left 🤣"),
 TextBlob("There are disgusting people There are miserable people And then there is Jorge Javier Vázquez"),
 TextBlob("@ ManuHdez14 Healthy envy I'll be able to go on other occasions. Cheer up"),
 TextBlob("@ldpsincomplejos Of all this list I am only interested and read Don Luis. The rest. Zero To continue like this 💪"),
 TextBlob("Much encouragement to the members who have gone to Catalonia as proxies. I envy

In [153]:
df['tweetssss'] = tweets_en

In [154]:
df

Unnamed: 0,tweet,id,account,source,date,length,likes,RTs,hashtags,language,tweet_en,tweetssss
0,Cuando estaba en el hospital muy mal con el co...,DFD_74,1360964958834614273,Twitter for Android,2021-02-14 14:51:46,157,18,2,[],es,"(W, h, e, n, , I, , w, a, s, , i, n, , t, ...","(W, h, e, n, , I, , w, a, s, , i, n, , t, ..."
1,En el parlamento catalán están los golpistas d...,DFD_74,1360960796679954437,Twitter for Android,2021-02-14 14:35:14,215,1,0,[],es,"(I, n, , t, h, e, , C, a, t, a, l, a, n, , ...","(I, n, , t, h, e, , C, a, t, a, l, a, n, , ..."
2,@Chaskanaui1 Vamos mejorando. Despacio pero firme,DFD_74,1360960263181238278,Twitter for Android,2021-02-14 14:33:06,49,1,0,[],es,"(@, , C, h, a, s, k, a, n, a, u, i, 1, , W, ...","(@, , C, h, a, s, k, a, n, a, u, i, 1, , W, ..."
3,@pigdemont_ @pardodevera Es una doña nadie. Va...,DFD_74,1360936343824723968,Twitter for Android,2021-02-14 12:58:03,75,2,0,[],es,"(@, p, i, g, d, e, m, o, n, t, _, , @, p, a, ...","(@, p, i, g, d, e, m, o, n, t, _, , @, p, a, ..."
4,Hay gente asquerosa Hay gente miserable Y lueg...,DFD_74,1360935288235847680,Twitter for Android,2021-02-14 12:53:52,74,58,19,[],es,"(T, h, e, r, e, , a, r, e, , d, i, s, g, u, ...","(T, h, e, r, e, , a, r, e, , d, i, s, g, u, ..."
5,@ManuHdez14 Envidia sana Ya podré ir en otras ...,DFD_74,1360934107837005828,Twitter for Android,2021-02-14 12:49:10,63,0,0,[],es,"(@, , M, a, n, u, H, d, e, z, 1, 4, , H, e, ...","(@, , M, a, n, u, H, d, e, z, 1, 4, , H, e, ..."
6,@ldpsincomplejos De toda esta lista solo me in...,DFD_74,1360932661603270661,Twitter for Android,2021-02-14 12:43:26,101,5,1,[],es,"(@, l, d, p, s, i, n, c, o, m, p, l, e, j, o, ...","(@, l, d, p, s, i, n, c, o, m, p, l, e, j, o, ..."
7,Mucho ánimo a los afiliados que habéis ido a C...,DFD_74,1360930687822540802,Twitter for Android,2021-02-14 12:35:35,163,53,16,[],es,"(M, u, c, h, , e, n, c, o, u, r, a, g, e, m, ...","(M, u, c, h, , e, n, c, o, u, r, a, g, e, m, ..."
8,Vamos catalanes !! Hay que ir a votar Ánimo,DFD_74,1360852311808090112,Twitter for Android,2021-02-14 07:24:09,45,27,6,[],es,"(L, e, t, ', s, , g, o, , C, a, t, a, l, a, ...","(L, e, t, ', s, , g, o, , C, a, t, a, l, a, ..."
9,@VidalQuadras Deben hacer el esfuerzo si. Y vo...,DFD_74,1360845887870746625,Twitter for Android,2021-02-14 06:58:37,55,8,1,[],es,"(@, V, i, d, a, l, Q, u, a, d, r, a, s, , T, ...","(@, V, i, d, a, l, Q, u, a, d, r, a, s, , T, ..."


## Hate speech level prediction

`Detoxify` library has three pre-trained models:

- original: toxic, severe_toxic, obscene, threat, insult, identity_hate.
- unbiased: toxicity, severe_toxicity, obscene, threat, insult, identity_attack, sexual_explicit.
- multilingual: toxicity

While *original* and *unbiased* models generate different scorings like: toxicity, obscene, threat, insult...etc, *multilingual* model only produces a toxicity scoring. On the other hand, *multilingual* model is the only one that can bring a toxicity score in a language different from english.

Like we want to analyze any kind of accounts independently from the origin, and in order to make a more robust and scalable solution, we are going to chose *multilingual* one.

In [109]:
# Hate speech level prediction
# ------------------------------------------------------------------------------
# Returns a dictionary with toxicity values of each tweet. The key of the
# dictionary is called 'toxicity'.
results = Detoxify('multilingual').predict(list(df['tweet']))
results

{'toxicity': [0.0007464243681170046,
  0.0004143674741499126,
  0.22248923778533936,
  0.00045432784827426076,
  0.00042401894461363554,
  0.0004101789672859013,
  0.003279693890362978,
  0.001105726696550846,
  0.000854434329085052,
  0.02295275218784809,
  0.0004325560003053397,
  0.5668947696685791,
  0.0012916838750243187,
  0.0004201886767987162,
  0.0004094214818906039,
  0.00041339985909871757,
  0.0005079717375338078,
  0.00040876006823964417,
  0.0004219318798277527,
  0.00041506424895487726]}

In [110]:
# Export the new info to the previous DataFrame
df['toxicity'] = results['toxicity']
df.head()

Unnamed: 0,tweet,id,account,source,date,length,likes,RTs,hashtags,toxicity
0,Accurate. https://t.co/UIbkQxyQoF,lexfridman,1348804908422754310,Twitter Web App,2021-01-12 01:32:04,33,10148,751,[],0.000746
1,Here's my 2nd conversation with Dmitry Korkin ...,lexfridman,1348585639630041092,Twitter Web App,2021-01-11 11:00:46,235,342,25,[],0.000414
2,When you turn your back on the voices of those...,lexfridman,1348306758452867075,Twitter Web App,2021-01-10 16:32:36,128,7296,792,[],0.222489
3,Life rarely happens exactly as planned. The fu...,lexfridman,1348297309436727296,Twitter Web App,2021-01-10 15:55:03,65,5545,540,[],0.000454
4,The math checks out. https://t.co/oepLfR8zok,lexfridman,1348110012904845321,Twitter Web App,2021-01-10 03:30:48,44,14355,1228,[],0.000424


## X. Bibliografía

- http://docs.tweepy.org/en/latest/
- https://github.com/unitaryai/detoxify

## X. Info a revisar

- Limpieza de datos con regex: http://rios.tecnm.mx/cdistribuido/recursos/MinDatScr/MineriaScribble.html
- Web objetivo: https://www.ninjalitics.com/


- Traduction --> TextBlob
- Sentiment Analysis --> TextBlob
- Universal Dependencies --> GSD

Preparar un traductor con TextBlob:

In [None]:
# de chino a inglés y español
oracion_zh = "中国探月工程 亦稱嫦娥工程，是中国启动的第一个探月工程，于2003年3月1日正式启动"
t_zh = TextBlob(oracion_zh)
print(t_zh.translate(from_lang="zh-CN", to="en"))
print(t_zh.translate(from_lang="zh-CN", to="es"))

oracion_ru = "В 1943 году была отправлена в США, где выступала в защиту британской «белой книги», после чего работала в Канаде и Индии."
t_ru = TextBlob(oracion_ru)
print(t_ru.translate(from_lang="ru", to="en"))
print(t_ru.translate(from_lang="ru", to="es"))

print("--------------")

print(
    TextBlob(
        """المحتوى هنا ينقصه الاستشهاد بمصادر. يرجى إيراد مصادر موثوق بها
. أي معلومات غير موثقة يمكن التشكيك بها وإزالتها. (ديسمبر 2018)"""
    ).translate(to="es")
)

TextBlob también tiene análisis de opinión:

In [None]:
# análisis de opinión
opinion1 = TextBlob("This new restaurant is great. I had so much fun!!")
print(opinion1.sentiment)

opinion2 = TextBlob("Google News to close in Spain.")
print(opinion2.sentiment)

# subjetividad 0:1
# polaridad -1:1


In [None]:
from textblob.sentiments import NaiveBayesAnalyzer

sentences = [
    "I enjoy sports.",
    "Pizza and pasta are my favorite food.",
    "The hotel was horrible.",
    "The movie was awesome. Wonderful play!",
    "Worst actors I've seen in my whole life"
]

for sentence in sentences:
    t = TextBlob(sentence, analyzer=NaiveBayesAnalyzer())
    print(f"{sentence}\nsubj: {t.sentiment}\n")

Y corrección ortográfica:

In [None]:
#  corrección ortográfica
b1 = TextBlob("I havv goood speling!")
print(b1.correct())

b2 = TextBlob("Miy naem iz Jonh!")
print(b2.correct())

b3 = TextBlob("Boyz dont cri")
print(b3.correct())

b4 = TextBlob("psicological posesion achifmen comitment")
print(b4.correct())