# Tweets Featuring

Loading some of the data captured in the previous Jupyter notebook "Read_tweets", let's create some new features.

In [1]:
%load_ext autoreload
%autoreload 2
%matplotlib inline
%load_ext watermark
%watermark -v -m -p numpy,pandas -g

import re
from tqdm import tqdm
import yaml
import watermark
import emoji                      # conda install -c conda-forge emoji
import pandas as pd
import numpy as np

CPython 3.7.3
IPython 7.6.1

numpy 1.16.4
pandas 0.24.2

compiler   : MSC v.1900 64 bit (AMD64)
system     : Windows
release    : 7
machine    : AMD64
processor  : Intel64 Family 6 Model 78 Stepping 3, GenuineIntel
CPU cores  : 4
interpreter: 64bit
Git hash   : b15a3f632e9233ec78c35a8cbe1da876e727bfc5


### Constants

In [2]:
TWEETS_INPUT = "tweets.csv"
PROCESSED_TWEETS = "tweets-processed.csv"
TWITER_USERS_FILE = 'twitter_users.csv'

### Load Data

In [3]:
tweets_df = pd.read_csv(TWEETS_INPUT, parse_dates=['created_at'])
users_df = pd.read_csv(TWITER_USERS_FILE, parse_dates=['created_at'])

In [4]:
tweets_df.head(3)

Unnamed: 0,screen_name,location,id,source,coordinates,favorite_count,favorited,lang,hashtags,created_at,text
0,ONYXCONtruth,ATL & The Universe,446395993,Instagram,,0,False,en,"[{'text': 'real', 'indices': [18, 23]}, {'text...",2019-07-04 22:57:55,Salute to all the #real #Artist that make #ONY...
1,CassiniFrank,"Vancouver, British Columbia",997922612255703040,Twitter for iPhone,,0,False,en,"[{'text': 'UK', 'indices': [47, 50]}, {'text':...",2019-07-04 22:57:47,RT @volition_movie: Mighty chuffed to have our...
2,afriwomencinema,,29717530,Twitter Web Client,,0,False,en,"[{'text': 'Senegal', 'indices': [107, 115]}]",2019-07-04 22:57:42,FROM THE AFRICAN WOMEN IN CINEMA BLOG ARCHIVES...


In [5]:
tweets_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3000 entries, 0 to 2999
Data columns (total 11 columns):
screen_name       3000 non-null object
location          2406 non-null object
id                3000 non-null int64
source            3000 non-null object
coordinates       81 non-null object
favorite_count    3000 non-null int64
favorited         3000 non-null bool
lang              3000 non-null object
hashtags          3000 non-null object
created_at        3000 non-null datetime64[ns]
text              3000 non-null object
dtypes: bool(1), datetime64[ns](1), int64(2), object(7)
memory usage: 237.4+ KB


In [6]:
tweets_df.describe()

Unnamed: 0,id,favorite_count
count,3000.0,3000.0
mean,3.2774e+17,0.769
std,4.604394e+17,4.222983
min,4046681.0,0.0
25%,190586900.0,0.0
50%,1906807000.0,0.0
75%,8.439787e+17,0.0
max,1.146907e+18,113.0


From previous information, location has a certain number of NaN which we can replace for the string 'unknown'. <br>
Coordinates on the other side is almost completely null values.

In [7]:
tweets_df['location'].fillna('unknown', inplace=True)

In [8]:
print('Kwnown coordinates: {}%'.format(100 * tweets_df.coordinates.count() / tweets_df.shape[0]))

Kwnown coordinates: 2.7%


In [9]:
tweets_df.drop(['coordinates'], axis=1, inplace=True)

## Extra features: tweets

* Extract tags embeded in the tweet:

In [10]:
tweets_df.text[0]

'Salute to all the #real #Artist that make #ONYXCON POSSIBLE ‼️ #popularArts #Film #ComicBooks #Gaming #HipHop… https://t.co/vFtdbfXvMn'

In [11]:
tag_regex = re.compile(r'#[\w]+')
def get_tags(text, regex=tag_regex):
    tags = regex.findall(text)
    return [k.replace('#', '') for k in tags]

In [12]:
tweets_df['tags'] = tweets_df.text.apply(get_tags, args = (tag_regex,))

In [13]:
def clean_tags(text, tags):
    # Eliminate tags from the text
    for tag in tags:
        text = re.sub(' +', ' ', text.replace('#' + tag, '').replace('\n', ' ') \
                      .replace('\r', '').replace(' ― ', '').replace(' …', ''))
    
    return text.strip()     # Clean up

In [14]:
tweets_df['text'] = tweets_df[['text', 'tags']].apply(lambda x: clean_tags(*x), axis=1)
tweets_df['text'][0]        # Verify updates...

'Salute to all the that make POSSIBLE ‼️ https://t.co/vFtdbfXvMn'

* Retweets

In [15]:
retweet_regex = re.compile(r'RT\s@[\w]+:')
def is_retweet(text, regex=retweet_regex):
    retweet = regex.findall(text)
    is_retweet = False
    author = ''
    if retweet:
        text = text.replace(retweet[0], '').strip()
        is_retweet = True
        author = retweet[0].replace('RT @', '').replace(':', '')
    return is_retweet, author, text

In [16]:
is_retweet, authors, text = zip(*tweets_df.text.apply(is_retweet, args = (retweet_regex,)))

In [17]:
tweets_df['is_retweet'], tweets_df['retweet_author'], tweets_df['text'] = [is_retweet, authors, text]

In [18]:
tweets_df.head(3)

Unnamed: 0,screen_name,location,id,source,favorite_count,favorited,lang,hashtags,created_at,text,tags,is_retweet,retweet_author
0,ONYXCONtruth,ATL & The Universe,446395993,Instagram,0,False,en,"[{'text': 'real', 'indices': [18, 23]}, {'text...",2019-07-04 22:57:55,Salute to all the that make POSSIBLE ‼️ https:...,"[real, Artist, ONYXCON, popularArts, Film, Com...",False,
1,CassiniFrank,"Vancouver, British Columbia",997922612255703040,Twitter for iPhone,0,False,en,"[{'text': 'UK', 'indices': [47, 50]}, {'text':...",2019-07-04 22:57:47,Mighty chuffed to have our at ’s legendary @Fr...,"[UK, Premiere, London, WestEnd, Fright]",True,volition_movie
2,afriwomencinema,unknown,29717530,Twitter Web Client,0,False,en,"[{'text': 'Senegal', 'indices': [107, 115]}]",2019-07-04 22:57:42,FROM THE AFRICAN WOMEN IN CINEMA BLOG ARCHIVES...,[Senegal],False,


* Number of words in the tweet

In [19]:
tweets_df['n_words'] = tweets_df.text.apply(len)

In [20]:
tweets_df['has_link'] = tweets_df.text.apply(lambda x: 'http' in x)

In [21]:
emoji_regex = emoji.get_emoji_regexp()
def capture_emojis(text):
    emojis = emoji_regex.findall(text)
    if emojis:
        emoji_count = len(emojis)
        for e in emojis:
            text = text.replace(e, '')
        text = text.strip()
        emojis = ' '.join(emojis)
    else:
        emoji_count = 0
        emojis = ''
    
    return emoji_count, emojis, text

In [22]:
emoji_count, emojis, text = zip(*tweets_df.text.apply(capture_emojis))
tweets_df['emoji_count'], tweets_df['emojis'], tweets_df['text'] = [emoji_count, emojis, text]

In [23]:
tweets_df.head(3)

Unnamed: 0,screen_name,location,id,source,favorite_count,favorited,lang,hashtags,created_at,text,tags,is_retweet,retweet_author,n_words,has_link,emoji_count,emojis
0,ONYXCONtruth,ATL & The Universe,446395993,Instagram,0,False,en,"[{'text': 'real', 'indices': [18, 23]}, {'text...",2019-07-04 22:57:55,Salute to all the that make POSSIBLE ️ https:/...,"[real, Artist, ONYXCON, popularArts, Film, Com...",False,,63,True,1,‼
1,CassiniFrank,"Vancouver, British Columbia",997922612255703040,Twitter for iPhone,0,False,en,"[{'text': 'UK', 'indices': [47, 50]}, {'text':...",2019-07-04 22:57:47,Mighty chuffed to have our at ’s legendary @Fr...,"[UK, Premiere, London, WestEnd, Fright]",True,volition_movie,81,False,0,
2,afriwomencinema,unknown,29717530,Twitter Web Client,0,False,en,"[{'text': 'Senegal', 'indices': [107, 115]}]",2019-07-04 22:57:42,FROM THE AFRICAN WOMEN IN CINEMA BLOG ARCHIVES...,[Senegal],False,,130,True,0,


In [24]:
tweets_df.to_csv(PROCESSED_TWEETS, index=False)

## Extra features: Users Info

In [25]:
users_df.head(3)

Unnamed: 0,name,screen_name,id,lang,followers_count,location,created_at,statuses_count,friends_count,description
0,mohd akmal,cipanoss,631924707,,3,,2012-07-10 09:43:19,1,53,b''
1,basset bourouro,bassetbourouro,2184098312,,0,,2013-11-09 10:56:00,2,7,b''
2,سعود,SAUD9969,597529653,,168,,2012-06-02 14:16:50,11,524,b'\xd9\x85\xd8\xba\xd8\xb1\xd8\xaf \xd8\xb3\xd...


In [26]:
users_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 548 entries, 0 to 547
Data columns (total 10 columns):
name               548 non-null object
screen_name        548 non-null object
id                 548 non-null int64
lang               0 non-null float64
followers_count    548 non-null int64
location           173 non-null object
created_at         548 non-null datetime64[ns]
statuses_count     548 non-null int64
friends_count      548 non-null int64
description        548 non-null object
dtypes: datetime64[ns](1), float64(1), int64(4), object(4)
memory usage: 42.9+ KB
