# @robotodio - tweet extractor

---

## Libraries

In [1]:
# Python libraries
# ------------------------------------------------------------------------------
# Reading files with different formats
import json

# Data wrangling
import re
import pandas as pd
import numpy as np

# Data Visualitation
import seaborn as sns
import matplotlib.pyplot as plt

# Twitter API
import tweepy

# NLP
from textblob import TextBlob
from googletrans import Translator

# Hate speech detection
from detoxify import Detoxify

## 1. Download data from twitter

The tweepy API class provides access to the entire Twitter RESTful API methods. Each method can accept various parameters and return responses to the different requests.

### 1.1. API set up

In order for you to get the Twitter feed working you need four keys: 

- Consumer Key
- Consumer Secret
- Access Token
- Access Token Secret

If you have a developer Twitter account you can generate your own tokens and log in with them. In case you don't have a developer account, please contact me and I will provide mine.

In [2]:
# API Twitter credentials
# ------------------------------------------------------------------------------
# Open .json file containing credentials/tokens as a dictionary
with open("../twitter_api_keys.json") as file:
    api_credentials = json.load(file)
    
# Assign each value of the dictionary to a new variable
consumer_key = api_credentials['consumer_key']
consumer_secret = api_credentials['consumer_secret']
access_token = api_credentials['access_token']
access_token_secret = api_credentials['access_token_secret']

In [3]:
# API set up
# ------------------------------------------------------------------------------
# Create a handler instance with key and secret consumer, and pass the tokens
auth = tweepy.OAuthHandler(consumer_key, consumer_secret)
auth.set_access_token(access_token, access_token_secret)
    
# Construct the API instance
api = tweepy.API(auth, wait_on_rate_limit=True, wait_on_rate_limit_notify=True)

# Check credentials
if(api.verify_credentials):
    print("Logged In Successfully.")
else:
    print("Error -- Could not log in with your credentials.")

Logged In Successfully.


### 1.2. Extract tweets and associated metadata

Once the `tweepy.API` class has been instantiated we can call its methods to scrap desired information in Twitter.

Pagination is used a lot in Twitter API development. Iterating through timelines, user lists, direct messages, etc. In order to perform pagination, we must supply a page/cursor parameter with each of our requests. The problem here is this requires a lot of boiler plate code just to manage the pagination loop. To help make pagination easier and require less code, Tweepy has the Cursor object.

In first place, we are going to define a function that returns an iterator of tweets. We will call this fuction each time we want to download tweets from an user account.

In [4]:
# Tweets list iterator
# ------------------------------------------------------------------------------
def tweets_iterator(target, n_items):
    '''
    Returns an iterator of tweets.

        Parameters:
            target (str): The user name of the Twitter account.
            n_items (int): Number of tweets downloaded.

        Returns:
            tweets (ItemIterator): an iterator of tweets.
    '''
    # Instantiate the iterator
    tweets = tweepy.Cursor(
        api.user_timeline,
        screen_name=target,
        include_rts=False,
        exclude_replies=False,
        tweet_mode='extended').items(n_items)
    
    # Returns iterator
    return tweets

In the next cell we are going to download $n$ tweets from a Twitter account, and examinate the content of the first tweet with all the metadata associated.

In [5]:
# Introduce the target Twitter account and number of items to download
target = 'DFD_74'
n_items = 50

# Tweets list (iterator)
tweets = tweets_iterator(target, n_items=1)

# Print the first tweet with .json format
print(json.dumps(next(tweets)._json, indent=4))

{
    "created_at": "Wed Mar 03 15:35:12 +0000 2021",
    "id": 1367136484449484802,
    "id_str": "1367136484449484802",
    "full_text": "Se condena a Felipe VI por lo que hizo su padre\nPues condenemos a Iglesias que su padre formaba parte de una banda terrorista...\nO solo vale para un lado?",
    "truncated": false,
    "display_text_range": [
        0,
        154
    ],
    "entities": {
        "hashtags": [],
        "symbols": [],
        "user_mentions": [],
        "urls": []
    },
    "source": "<a href=\"http://twitter.com/download/android\" rel=\"nofollow\">Twitter for Android</a>",
    "in_reply_to_status_id": null,
    "in_reply_to_status_id_str": null,
    "in_reply_to_user_id": null,
    "in_reply_to_user_id_str": null,
    "in_reply_to_screen_name": null,
    "user": {
        "id": 908629539609341952,
        "id_str": "908629539609341952",
        "name": "Espa\u00f1ol74 (david) \ud83c\uddea\ud83c\uddf8",
        "screen_name": "DFD_74",
        "location": "",


As we can see in the previous output cell, each tweet contains a lot of metadata. All this information comes in .json format, and we can access to it with Tweepy methods, or with dictionary methods.

In the next cell we are going to extract some metadata from each tweet, and export to a Pandas DataFrame format.

In [6]:
# Tweet extractor
# ------------------------------------------------------------------------------
# Tweets list (iterator)
tweets = tweets_iterator(target, n_items)

# Read through the iterator, and export the info to a Pandas DataFrame
all_columns = [np.array([
    tweet.full_text,
    tweet.user.screen_name,
    tweet.id,
    tweet.user.followers_count,
    tweet.source,
    tweet.created_at,
    len(tweet.full_text),
    tweet.favorite_count,
    tweet.retweet_count,
    str(tweet.entities['hashtags'])
]) for tweet in tweets]

# Export the list of tweets to a dataframe
df = pd.DataFrame(
    data=all_columns,
    columns=['tweet', 'account', 'id', 'followers', 'source', 'date', 'length',
              'likes', 'RTs', 'hashtags'])

df.head()

Unnamed: 0,tweet,account,id,followers,source,date,length,likes,RTs,hashtags
0,Se condena a Felipe VI por lo que hizo su padr...,DFD_74,1367136484449484802,5790,Twitter for Android,2021-03-03 15:35:12,154,38,18,[]
1,@LZMPodcast Es muy vago. No triunfará,DFD_74,1367131640607162373,5790,Twitter for Android,2021-03-03 15:15:57,37,1,0,[]
2,Alguna vez habéis visto que un medio vaya busc...,DFD_74,1367131300826644482,5790,Twitter for Android,2021-03-03 15:14:36,169,11,5,[]
3,"@ElenaBerberana @perezlealypunto Le corrijo, n...",DFD_74,1367070474920992773,5790,Twitter for Android,2021-03-03 11:12:54,81,1,0,[]
4,Ni caso a Sánchez \nNo salgáis el 8M con esas ...,DFD_74,1367069720172707841,5790,Twitter for Android,2021-03-03 11:09:54,108,6,1,[]


In addition, we are going to create another dataframe with the main data of the target account:

In [7]:
# Account data extractor
# ------------------------------------------------------------------------------
# Tweets list (iterator)
tweets = tweets_iterator(target, n_items=1)

# Read through the iterator, and export the info to a Pandas DataFrame
data_account = [np.array([
    tweet.user.screen_name,
    tweet.user.name,
    tweet.user.description,
    tweet.user.created_at,
    tweet.user.friends_count,
    tweet.user.followers_count,
    tweet.user.statuses_count
]) for tweet in tweets]

# Export the list of features to a dataframe
df_account = pd.DataFrame(
    data=data_account,
    columns=['account', 'account_name', 'bio_description', 'creation_date', 'friends',
             'followers', 'tweets']
)

df_account = pd.melt(df_account)
df_account

Unnamed: 0,variable,value
0,account,DFD_74
1,account_name,Español74 (david) 🇪🇸
2,bio_description,"Padre orgulloso, marido enamorado, currante, e..."
3,creation_date,2017-09-15 09:52:18
4,friends,588
5,followers,5790
6,tweets,61305


### 1.3. Clean and translation

Once we have a dataset with all the tweets of an user, we need to prepare and clean them to serve as input to the model. For each tweet we are going to eliminate special string characters, and we will translate them into English in case they are not.

The traduction is needed for two mains reasons:

- Make a more robust and extensible app
- Detoxify model provides more information when the input is in english language

**Cleaning:**

In [8]:
# Data cleaning
# ------------------------------------------------------------------------------
# Characters to remove
spec_chars = ['\n', '\t', '\r']

# Replace defined characters with a whitespace
for char in spec_chars:
    df['tweet'] = df['tweet'].str.strip().replace(char, ' ')

Because we replaced the special characters with a whitespace, we might end up with double whitespaces in some values. Let’s remove them by splitting each tweet using whitespaces and re-joining the words again using join.

In [9]:
# Split and re-join each tweet
df['tweet'] = df['tweet'].str.split().str.join(" ")

In [10]:
df.head()

Unnamed: 0,tweet,account,id,followers,source,date,length,likes,RTs,hashtags
0,Se condena a Felipe VI por lo que hizo su padr...,DFD_74,1367136484449484802,5790,Twitter for Android,2021-03-03 15:35:12,154,38,18,[]
1,@LZMPodcast Es muy vago. No triunfará,DFD_74,1367131640607162373,5790,Twitter for Android,2021-03-03 15:15:57,37,1,0,[]
2,Alguna vez habéis visto que un medio vaya busc...,DFD_74,1367131300826644482,5790,Twitter for Android,2021-03-03 15:14:36,169,11,5,[]
3,"@ElenaBerberana @perezlealypunto Le corrijo, n...",DFD_74,1367070474920992773,5790,Twitter for Android,2021-03-03 11:12:54,81,1,0,[]
4,Ni caso a Sánchez No salgáis el 8M con esas gu...,DFD_74,1367069720172707841,5790,Twitter for Android,2021-03-03 11:09:54,108,6,1,[]


In [16]:
tweets_cadena = list(df['tweet'])
tweets_cadena

['Se condena a Felipe VI por lo que hizo su padre Pues condenemos a Iglesias que su padre formaba parte de una banda terrorista... O solo vale para un lado?',
 '@LZMPodcast Es muy vago. No triunfará',
 'Alguna vez habéis visto que un medio vaya buscando votantes de podemos, o Bildu comunistas de extrema izquierda proetarras? Pues aquí se buscan a los de vox Inútiles',
 '@ElenaBerberana @perezlealypunto Le corrijo, no son antifascistas. Eso no cuadra.',
 'Ni caso a Sánchez No salgáis el 8M con esas guarras sectarias y cochinotas. Esto no ha acabado Precaución',
 '@TvPlataforma @ivanedlm @rtve @CasaReal No se podría querellar contra este nido de sectarios e inútiles?',
 '@MariaJamardoC El problema. Que está en el partido equivocado',
 'Señora @Macarena_Olona esto no es para querellarse contra la tve? Gracias https://t.co/A6WoPRZ3cG',
 '@federico_videos @rtve Hay que cerrar ese nido de víboras y sectarios No quiero dar un euro más a esos inútiles',
 'Asco da tu tuit Jordi Otra decepción h

**Translation:**

Finally we translate them into English. For carrying out this task, we are going to use the TextBlob library, which is a module for processing textual data. It provides a simple API for diving into common natural language processing (NLP) tasks such as part-of-speech tagging, noun phrase extraction, sentiment analysis, classification, translation, and more.

In the next cell we can see an example of translation:

In [None]:
'''
# Translation example
text = "Esta frase está en español y queremos traducirla al inglés."
translated_text = TextBlob(text).translate(to="en").string
translated_text
'''

In [None]:
'''
# Main language detection
# ------------------------------------------------------------------------------
# Detect the language pf each tweet
df['org_language'] = df['tweet'].apply(
    lambda tweet: TextBlob(tweet).detect_language())

# Summary of the different languages
freq_languages = df[['id', 'org_language']].groupby(["org_language"])\
                                           .count()\
                                           .sort_values(["id"],  ascending=False)\
                                           .reset_index()

print(freq_languages)
print('-'*80)

# Most common language
main_language = freq_languages.values[0][0]
print(f"The most common language in @{target} account is {main_language}.")
'''

Due to limitations with the TextBlob library and to avoiding be banned, we are going to use another library called GoogleTrans that performs very similar to TextBlob.

In [13]:
translator = Translator(user_agent='Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3987.149 Safari/537.36')

translator.translate("Esta frase está en español y queremos traducirla al inglés.", src='es', dest='en').text

'Esta frase está en español y queremos traducirla al inglés.'

In [79]:
# Translation
# ------------------------------------------------------------------------------
# Instance a np.array variable that contains all tweets
tweets = np.array(df['tweet'])

# We apply the GoogleTrans translation to each tweet of the previous np.array
tweets_en = list(map(lambda tweet: translator.translate(tweet, src='en', dest='en').text, tweets))

# Create another pandas column with the transalation values
df['tweet_en'] = tweets_en

In [80]:
df.head()

Unnamed: 0,tweet,account,id,followers,source,date,length,likes,RTs,hashtags,tweet_en
0,Se condena a Felipe VI por lo que hizo su padr...,DFD_74,1367136484449484802,5790,Twitter for Android,2021-03-03 15:35:12,154,37,18,[],Se condena a Felipe VI por lo que hizo su padr...
1,@LZMPodcast Es muy vago. No triunfará,DFD_74,1367131640607162373,5790,Twitter for Android,2021-03-03 15:15:57,37,1,0,[],@LZMPodcast Es muy vago. No triunfará
2,Alguna vez habéis visto que un medio vaya busc...,DFD_74,1367131300826644482,5790,Twitter for Android,2021-03-03 15:14:36,169,11,5,[],Alguna vez habéis visto que un medio vaya busc...
3,"@ElenaBerberana @perezlealypunto Le corrijo, n...",DFD_74,1367070474920992773,5790,Twitter for Android,2021-03-03 11:12:54,81,1,0,[],"@ElenaBerberana @perezlealypunto Le corrijo, n..."
4,Ni caso a Sánchez No salgáis el 8M con esas gu...,DFD_74,1367069720172707841,5790,Twitter for Android,2021-03-03 11:09:54,108,6,1,[],Ni caso a Sánchez No salgáis el 8M con esas gu...


## 2. Hate speech level prediction

`Detoxify` library has three pre-trained models:

- original: toxic, severe_toxic, obscene, threat, insult, identity_hate.
- unbiased: toxicity, severe_toxicity, obscene, threat, insult, identity_attack, sexual_explicit.
- multilingual: toxicity

While *original* and *unbiased* models generate different scorings like: toxicity, obscene, threat, insult...etc, *multilingual* model only produces a toxicity scoring. On the other hand, *multilingual* model is the only one that can bring a toxicity score in a language different from english.

Thanks to translation, we can use *original* or *unbiased* model independently from the tweets origin language.

In [67]:
# Hate speech level prediction
# ------------------------------------------------------------------------------
# Returns a dictionary with toxicity values of each tweet. The keys of the
# dictionary are called 'toxicity', 'obscene'...etc
results = Detoxify('original').predict(list(df['tweet_en']))
results

{'toxicity': [0.004285919945687056,
  0.0016925910022109747,
  0.14064688980579376,
  0.016382349655032158,
  0.9529476761817932,
  0.03794165328145027,
  0.003775323973968625,
  0.0009417636902071536,
  0.40279725193977356,
  0.5694505572319031,
  0.8522031307220459,
  0.9654009342193604,
  0.9464920163154602,
  0.04191116243600845,
  0.973484456539154,
  0.17547106742858887,
  0.8875462412834167,
  0.20371657609939575,
  0.009146633557975292,
  0.9354430437088013,
  0.0010710255010053515,
  0.06759965419769287,
  0.0024761538952589035,
  0.9689670205116272,
  0.9640780687332153,
  0.8397986888885498,
  0.9923396110534668,
  0.9584523439407349,
  0.9940893650054932,
  0.9470892548561096,
  0.0009058431023731828,
  0.5406716465950012,
  0.982893168926239,
  0.906766951084137,
  0.9793456792831421,
  0.0008766066748648882,
  0.9787160158157349,
  0.0022771931253373623,
  0.0010056247701868415,
  0.9931908249855042],
 'severe_toxicity': [0.00011028467997675762,
  9.492522076470777e-05,
 

In [68]:
# Add the new info to the previous DataFrame
df['toxicity'] = results['toxicity']
df['obscene'] = results['obscene']
df['threat'] = results['threat']
df['insult'] = results['insult']
df['identity_hate'] = results['identity_hate']

df.head()

Unnamed: 0,tweet,account,id,followers,source,date,length,likes,RTs,hashtags,tweet_en,toxicity,obscene,threat,insult,identity_hate
0,Se condena a Felipe VI por lo que hizo su padr...,DFD_74,1367136484449484802,5790,Twitter for Android,2021-03-03 15:35:12,154,36,17,[],Felipe VI is sentenced for what his father did...,0.004286,0.000241,0.000137,0.000312,0.000355
1,@LZMPodcast Es muy vago. No triunfará,DFD_74,1367131640607162373,5790,Twitter for Android,2021-03-03 15:15:57,37,1,0,[],@LZMPodcast It's very vague. Will not succeed,0.001693,0.000207,0.000116,0.000199,0.000151
2,Alguna vez habéis visto que un medio vaya busc...,DFD_74,1367131300826644482,5790,Twitter for Android,2021-03-03 15:14:36,169,11,5,[],Have you ever seen that a media is looking for...,0.140647,0.001213,0.000403,0.009432,0.004002
3,"@ElenaBerberana @perezlealypunto Le corrijo, n...",DFD_74,1367070474920992773,5790,Twitter for Android,2021-03-03 11:12:54,81,1,0,[],@ElenaBerberana @perezlealypunto I correct you...,0.016382,0.000371,0.00025,0.000704,0.000943
4,Ni caso a Sánchez No salgáis el 8M con esas gu...,DFD_74,1367069720172707841,5790,Twitter for Android,2021-03-03 11:09:54,108,6,1,[],No case of Sánchez. Don't go out on 8M with th...,0.952948,0.755624,0.001876,0.793621,0.048592


Once we have a score for each tweet, we can aggregate these puntuactions by calculating the average.

In [None]:
Definir un umbral de toxicidad!!

In [44]:
sum(df['toxicity'])/len(df)

0.41287379481073005

In [43]:
df['toxicity'].mean()

0.41287379481073005

In [69]:
scoring_average = {'variable': ['avg_toxicity',
                                'avg_obscene', 
                                'avg_threat',
                                'avg_insult',
                                'avg_identity_hate'],
                   
                   'value': [sum(df['toxicity'])/len(df),
                             sum(df['obscene'])/len(df),
                             sum(df['threat'])/len(df),
                             sum(df['insult'])/len(df),
                             sum(df['identity_hate'])/len(df)]}

df_average = pd.DataFrame(scoring_average)
df_average

Unnamed: 0,variable,value
0,avg_toxicity,0.506107
1,avg_obscene,0.337326
2,avg_threat,0.034293
3,avg_insult,0.320647
4,avg_identity_hate,0.032188


In [70]:
df_account = pd.concat([df_account, df_average])
df_account

Unnamed: 0,variable,value
0,account,DFD_74
1,account_name,Español74 (david) 🇪🇸
2,bio_description,"Padre orgulloso, marido enamorado, currante, e..."
3,creation_date,2017-09-15 09:52:18
4,friends,588
5,followers,5790
6,tweets,61305
0,avg_toxicity,0.506107
1,avg_obscene,0.337326
2,avg_threat,0.034293


## 3. Data explotation

1. Toxicidad vs RT + Likes
2. 

In [None]:
Ver si crecen los followers con los mensajes de odio

In [None]:
fig, ax = plt.subplots(2, 2, figsize=(14,8))

sns.scatterplot(ax=ax[0,0], data=df, x='sepal_length', y='sepal_width');
sns.scatterplot(ax=ax[0,1], data=df, x='sepal_length', y='sepal_width', hue='species');
sns.scatterplot(ax=ax[1,0], data=df, x='sepal_length', y='sepal_width', hue='petal_width', size='petal_width');
sns.scatterplot(ax=ax[1,1], data=df, x='sepal_length', y='sepal_width', hue='sepal_length', size='sepal_length');

In [None]:
# Ver para 
sns.catplot(data=penguins, x='bill_length_mm', y='species', hue='sex', kind='strip');

## X. Bibliografía

- http://docs.tweepy.org/en/latest/
- https://github.com/unitaryai/detoxify

## X. Info a revisar

- Limpieza de datos con regex: http://rios.tecnm.mx/cdistribuido/recursos/MinDatScr/MineriaScribble.html
- Web objetivo: https://www.ninjalitics.com/


- Traduction --> TextBlob
- Sentiment Analysis --> TextBlob
- Universal Dependencies --> GSD

Preparar un traductor con TextBlob:

TextBlob también tiene análisis de opinión:

In [None]:
https://blog.usejournal.com/why-and-how-to-make-a-requirements-txt-f329c685181e