# 3 - Million Russian Troll Tweets
- James M Irving, Ph.D.
- Mod 4 Project
- Flatiron Full Time Data Science Bootcamp - 02/2019 Cohort

## GOAL: 

- *IF I can get a control dataset* of non-Troll tweets from same time period with similar hashtags:*
    - Use NLP to predict of a tweet is from an authentic user or a Russian troll.
- *If no control tweets to compare to*
    - Use NLP to predict how many retweets a Troll tweet will get.
    - Consider both raw # of retweets, as well as a normalized # of retweets/# of followers.
        - The latter would give better indication of language's effect on propagation. 
        

## Dataset Features:
- Kaggle Dataset published by FiveThirtyEight
    - https://www.kaggle.com/fivethirtyeight/russian-troll-tweets/downloads/russian-troll-tweets.zip/2
<br>    
- Data is split into 9 .csv files
    - 'IRAhandle_tweets_1.csv' to 9

- **Variables:**
    - ~~`external_author_id` | An author account ID from Twitter~~
    - `author` | The handle sending the tweet
    - `content` | The text of the tweet
    - `region` | A region classification, as [determined by Social Studio](https://help.salesforce.com/articleView?   id=000199367&type=1)
    - `language` | The language of the tweet
    - `publish_date` | The date and time the tweet was sent
    - ~~`harvested_date` | The date and time the tweet was collected by Social Studio~~
    - `following` | The number of accounts the handle was following at the time of the tweet
    - `followers` | The number of followers the handle had at the time of the tweet
    - `updates` | The number of “update actions” on the account that authored the tweet, including tweets, retweets and likes
    - `post_type` | Indicates if the tweet was a retweet or a quote-tweet *[Whats a quote-tweet?]*
    - `account_type` | Specific account theme, as coded by Linvill and Warren
    - `retweet` | A binary indicator of whether or not the tweet is a retweet [?]
    - `account_category` | General account theme, as coded by Linvill and Warren
    - `new_june_2018` | A binary indicator of whether the handle was newly listed in June 2018
    
### **Classification of account_type**
Taken from: [rcmediafreedom.eu summary](https://www.rcmediafreedom.eu/Publications/Academic-sources/Troll-Factories-The-Internet-Research-Agency-and-State-Sponsored-Agenda-Building)

>- **They identified five categories of IRA-associated Twitter accounts, each with unique patterns of behaviors:**
    - **Right Troll**, spreading nativist and right-leaning populist messages. It supported the candidacy and Presidency of Donald Trump and denigrated the Democratic Party. It often sent divisive messages about mainstream and moderate Republicans.
    - **Left Troll**, sending socially liberal messages and discussing gender, sexual, religious, and -especially- racial identity. Many tweets seemed intentionally divisive, attacking mainstream Democratic politicians, particularly Hillary Clinton, while supporting Bernie Sanders prior to the election.
    - **News Feed**, overwhelmingly presenting themselves as U.S. local news aggregators, linking to legitimate regional news sources and tweeting about issues of local interest.
    - **Hashtag Gamer**, dedicated almost exclusively to playing hashtag games.
    - **Fearmonger**: spreading a hoax about poisoned turkeys near the 2015 Thanksgiving holiday.

>The different types of account were used differently and their efforts were conducted systematically, with different allocation when faced with different political circumstances or shifting goals. E.g.: there was a spike of activity by right and left troll accounts before the publication of John Podesta's emails by WikiLeaks. According to the authors, this activity can be characterised as “industrialized political warfare”.

___

In [1]:
import bs_ds as bs
from bs_ds.imports import *

bs_ds v. 0.7.4 ... read the docs at https://bs-ds.readthedocs.io/en/latest/index.html
For convenient loading of standard modules :
>> from bs_ds.imports import *



[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\james\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


Unnamed: 0,Module/Package Handle
pandas,pd
numpy,np
matplotlib,mpl
matplotlib.pyplot,plt
seaborn,sns


In [2]:
import os
root_dir = 'russian-troll-tweets/'
# os.listdir('russian-troll-tweets/')
filelist = [os.path.join(root_dir,file) for file in os.listdir(root_dir) if file.endswith('.csv')]
filelist

['russian-troll-tweets/IRAhandle_tweets_1.csv',
 'russian-troll-tweets/IRAhandle_tweets_2.csv',
 'russian-troll-tweets/IRAhandle_tweets_3.csv',
 'russian-troll-tweets/IRAhandle_tweets_4.csv',
 'russian-troll-tweets/IRAhandle_tweets_5.csv',
 'russian-troll-tweets/IRAhandle_tweets_6.csv',
 'russian-troll-tweets/IRAhandle_tweets_7.csv',
 'russian-troll-tweets/IRAhandle_tweets_8.csv',
 'russian-troll-tweets/IRAhandle_tweets_9.csv']

In [3]:
# Previewing dataset
df = pd.read_csv(filelist[0])
df.head(3)

Unnamed: 0,external_author_id,author,content,region,language,publish_date,harvested_date,following,followers,updates,post_type,account_type,new_june_2018,retweet,account_category
0,9.06e+17,10_GOP,"""We have a sitting Democrat US Senator on tria...",Unknown,English,10/1/2017 19:58,10/1/2017 19:59,1052,9636,253,,Right,0,0,RightTroll
1,9.06e+17,10_GOP,Marshawn Lynch arrives to game in anti-Trump s...,Unknown,English,10/1/2017 22:43,10/1/2017 22:43,1054,9637,254,,Right,0,0,RightTroll
2,9.06e+17,10_GOP,Daughter of fallen Navy Sailor delivers powerf...,Unknown,English,10/1/2017 22:50,10/1/2017 22:51,1054,9637,255,RETWEET,Right,0,1,RightTroll


## Merging full dataset

In [4]:
# Vertically concatenate 
df = pd.DataFrame()
for file in filelist:
    df_new = pd.read_csv(file)
    df = pd.concat([df,df_new], axis=0)
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 2973371 entries, 0 to 37554
Data columns (total 15 columns):
external_author_id    float64
author                object
content               object
region                object
language              object
publish_date          object
harvested_date        object
following             int64
followers             int64
updates               int64
post_type             object
account_type          object
new_june_2018         int64
retweet               int64
account_category      object
dtypes: float64(1), int64(5), object(9)
memory usage: 363.0+ MB


In [5]:
df.head(2)

Unnamed: 0,external_author_id,author,content,region,language,publish_date,harvested_date,following,followers,updates,post_type,account_type,new_june_2018,retweet,account_category
0,9.06e+17,10_GOP,"""We have a sitting Democrat US Senator on tria...",Unknown,English,10/1/2017 19:58,10/1/2017 19:59,1052,9636,253,,Right,0,0,RightTroll
1,9.06e+17,10_GOP,Marshawn Lynch arrives to game in anti-Trump s...,Unknown,English,10/1/2017 22:43,10/1/2017 22:43,1054,9637,254,,Right,0,0,RightTroll


# SCRUB / EDA

In [6]:
from pandas_profiling import ProfileReport
ProfileReport(df)

0,1
Number of variables,16
Number of observations,2973371
Total Missing (%),3.5%
Total size in memory,363.0 MiB
Average record size in memory,128.0 B

0,1
Numeric,5
Categorical,9
Boolean,2
Date,0
Text (Unique),0
Rejected,0
Unsupported,0

0,1
Distinct count,8
Unique (%),0.0%
Missing (%),0.0%
Missing (n),0

0,1
NonEnglish,837725
RightTroll,719087
NewsFeed,599294
Other values (5),817265

Value,Count,Frequency (%),Unnamed: 3
NonEnglish,837725,28.2%,
RightTroll,719087,24.2%,
NewsFeed,599294,20.2%,
LeftTroll,427811,14.4%,
HashtagGamer,241827,8.1%,
Commercial,122582,4.1%,
Unknown,13905,0.5%,
Fearmonger,11140,0.4%,

0,1
Distinct count,21
Unique (%),0.0%
Missing (%),0.0%
Missing (n),363

0,1
Russian,721191
Right,718619
local,460197
Other values (17),1073001

Value,Count,Frequency (%),Unnamed: 3
Russian,721191,24.3%,
Right,718619,24.2%,
local,460197,15.5%,
left,427811,14.4%,
Hashtager,241827,8.1%,
news,139097,4.7%,
Commercial,122582,4.1%,
German,91851,3.1%,
Italian,15899,0.5%,
?,13542,0.5%,

0,1
Distinct count,2848
Unique (%),0.1%
Missing (%),0.0%
Missing (n),0

0,1
EXQUOTE,59652
SCREAMYMONKEY,44041
WORLDNEWSPOLI,36974
Other values (2845),2832704

Value,Count,Frequency (%),Unnamed: 3
EXQUOTE,59652,2.0%,
SCREAMYMONKEY,44041,1.5%,
WORLDNEWSPOLI,36974,1.2%,
AMELIEBALDWIN,35371,1.2%,
TODAYPITTSBURGH,33602,1.1%,
SPECIALAFFAIR,32588,1.1%,
SEATTLE_POST,30800,1.0%,
FINDDIET,29038,1.0%,
KANSASDAILYNEWS,28890,1.0%,
ROOMOFRUMOR,28360,1.0%,

0,1
Distinct count,2365943
Unique (%),79.6%
Missing (%),0.0%
Missing (n),1

0,1
В городе Сочи. Олимпиада – праздник или стихийное...,670
Лондон 2012 — Олимпиада Антихриста,227
NewsOne Now Audio Podcast: Bishop E.W. Jackson Calls #BlackLivesMatter Is Movement “Disgraceful”,217
Other values (2365939),2972256

Value,Count,Frequency (%),Unnamed: 3
В городе Сочи. Олимпиада – праздник или стихийное...,670,0.0%,
Лондон 2012 — Олимпиада Антихриста,227,0.0%,
NewsOne Now Audio Podcast: Bishop E.W. Jackson Calls #BlackLivesMatter Is Movement “Disgraceful”,217,0.0%,
"...стадион, У нас своя олимпиада – За малышом бросок под стол...",213,0.0%,
Захарченко: ОБСЕ игнорирует гибель мирных жителей http://t.co/bvDdvi71yc http://t.co/7Wyyi5sYdh,197,0.0%,
"Вот в такую ситуацию можно попасть, если заказать первое попавшееся такси http://t.co/lLiWML7UCi http://t.co/EGzPOn15mO",144,0.0%,
"Олимпиада 2014. Андрей Малахов, Дмитрий Борисов и Кирилл",141,0.0%,
Honor scores #sports,130,0.0%,
TV/radio schedule #sports,119,0.0%,
SE Wis. road construction projects #Wisconsin,114,0.0%,

0,1
Distinct count,2490
Unique (%),0.1%
Missing (%),0.0%
Missing (n),4
Infinite (%),0.0%
Infinite (n),0

0,1
Mean,1.2961e+17
Minimum,34976000
Maximum,9.8125e+17
Zeros (%),0.0%

0,1
Minimum,34976000.0
5-th percentile,1647500000.0
Q1,1930700000.0
Median,2581800000.0
Q3,3254300000.0
95-th percentile,8.92e+17
Maximum,9.8125e+17
Range,9.8125e+17
Interquartile range,1323500000.0

0,1
Standard deviation,3.0363e+17
Coef of variation,2.3426
Kurtosis,1.8042
Mean,1.2961e+17
MAD,2.19e+17
Skewness,1.9366
Sum,3.8539e+23
Variance,9.2194e+34
Memory size,22.7 MiB

Value,Count,Frequency (%),Unnamed: 3
8.92e+17,65401,2.2%,
3272640600.0,45985,1.5%,
8.98e+17,45108,1.5%,
2943515140.0,44041,1.5%,
7.89e+17,43395,1.5%,
8.95e+17,38644,1.3%,
1679279490.0,35371,1.2%,
2601235821.0,33602,1.1%,
2951556370.0,32588,1.1%,
8.91e+17,31726,1.1%,

Value,Count,Frequency (%),Unnamed: 3
34976398.0,32,0.0%,
72581988.0,40,0.0%,
87588938.0,1993,0.1%,
97335028.0,232,0.0%,
131812518.0,1,0.0%,

Value,Count,Frequency (%),Unnamed: 3
9.44543e+17,1,0.0%,
9.44573e+17,1,0.0%,
9.44658e+17,35,0.0%,
9.68131e+17,1,0.0%,
9.81251e+17,96,0.0%,

0,1
Distinct count,66363
Unique (%),2.2%
Missing (%),0.0%
Missing (n),0
Infinite (%),0.0%
Infinite (n),0

0,1
Mean,7018.9
Minimum,-1
Maximum,251276
Zeros (%),0.5%

0,1
Minimum,-1
5-th percentile,50
Q1,320
Median,1274
Q3,10600
95-th percentile,26669
Maximum,251276
Range,251277
Interquartile range,10280

0,1
Standard deviation,14585
Coef of variation,2.0779
Kurtosis,58.41
Mean,7018.9
MAD,8538.4
Skewness,5.9359
Sum,20869833400
Variance,212710000
Memory size,22.7 MiB

Value,Count,Frequency (%),Unnamed: 3
0,13973,0.5%,
2,11148,0.4%,
1,9726,0.3%,
4,7525,0.3%,
5,7170,0.2%,
7,6167,0.2%,
3,5917,0.2%,
6,5897,0.2%,
8,5080,0.2%,
83,4555,0.2%,

Value,Count,Frequency (%),Unnamed: 3
-1,3,0.0%,
0,13973,0.5%,
1,9726,0.3%,
2,11148,0.4%,
3,5917,0.2%,

Value,Count,Frequency (%),Unnamed: 3
251265,1,0.0%,
251266,1,0.0%,
251267,1,0.0%,
251275,2,0.0%,
251276,1,0.0%,

0,1
Distinct count,28224
Unique (%),0.9%
Missing (%),0.0%
Missing (n),0
Infinite (%),0.0%
Infinite (n),0

0,1
Mean,3433.5
Minimum,-1
Maximum,76210
Zeros (%),2.0%

0,1
Minimum,-1
5-th percentile,3
Q1,327
Median,1499
Q3,4730
95-th percentile,13039
Maximum,76210
Range,76211
Interquartile range,4403

0,1
Standard deviation,5609.9
Coef of variation,1.6339
Kurtosis,48.907
Mean,3433.5
MAD,3433.1
Skewness,5.2837
Sum,10209139778
Variance,31471000
Memory size,22.7 MiB

Value,Count,Frequency (%),Unnamed: 3
2,61372,2.1%,
0,58730,2.0%,
3,38158,1.3%,
5,21242,0.7%,
68,12119,0.4%,
74,9948,0.3%,
65,7272,0.2%,
7,6816,0.2%,
120,5852,0.2%,
247,5719,0.2%,

Value,Count,Frequency (%),Unnamed: 3
-1,3,0.0%,
0,58730,2.0%,
1,4719,0.2%,
2,61372,2.1%,
3,38158,1.3%,

Value,Count,Frequency (%),Unnamed: 3
76206,13,0.0%,
76207,3,0.0%,
76208,2,0.0%,
76209,4,0.0%,
76210,4,0.0%,

0,1
Distinct count,906316
Unique (%),30.5%
Missing (%),0.0%
Missing (n),0

0,1
3/22/2016 17:35,1333
12/29/2016 4:01,455
3/22/2016 17:34,232
Other values (906313),2971351

Value,Count,Frequency (%),Unnamed: 3
3/22/2016 17:35,1333,0.0%,
12/29/2016 4:01,455,0.0%,
3/22/2016 17:34,232,0.0%,
6/20/2018 4:03,224,0.0%,
8/16/2017 1:32,216,0.0%,
8/16/2017 1:29,200,0.0%,
10/30/2016 14:57,178,0.0%,
8/15/2017 17:11,149,0.0%,
12/29/2016 4:30,144,0.0%,
8/16/2017 1:30,144,0.0%,

0,1
Distinct count,397232
Unique (%),13.4%
Missing (%),0.0%
Missing (n),0
Infinite (%),0.0%
Infinite (n),0

0,1
Mean,182210
Minimum,0
Maximum,397231
Zeros (%),0.0%

0,1
Minimum,0
5-th percentile,16518
Q1,88223
Median,181140
Q3,274060
95-th percentile,355140
Maximum,397231
Range,397231
Interquartile range,185840

0,1
Standard deviation,108200
Coef of variation,0.59384
Kurtosis,-1.1622
Mean,182210
MAD,93452
Skewness,0.05064
Sum,541765416541
Variance,11708000000
Memory size,22.7 MiB

Value,Count,Frequency (%),Unnamed: 3
9,2047,0.1%,
9,7377,0.2%,
9,27871,0.9%,
9,25822,0.9%,
9,31965,1.1%,
9,29916,1.0%,
9,19675,0.7%,
9,17626,0.6%,
9,23769,0.8%,
9,21720,0.7%,

Value,Count,Frequency (%),Unnamed: 3
1,390466,13.1%,
1,393299,13.2%,
1,394660,13.3%,
1,396709,13.3%,
1,391016,13.2%,

Value,Count,Frequency (%),Unnamed: 3
9,30841,1.0%,
9,24698,0.8%,
9,26747,0.9%,
9,12404,0.4%,
9,2047,0.1%,

0,1
Distinct count,56
Unique (%),0.0%
Missing (%),0.0%
Missing (n),0

0,1
English,2128963
Russian,624124
German,87171
Other values (53),133113

Value,Count,Frequency (%),Unnamed: 3
English,2128963,71.6%,
Russian,624124,21.0%,
German,87171,2.9%,
Ukrainian,39361,1.3%,
Italian,18254,0.6%,
Serbian,9615,0.3%,
Uzbek,9491,0.3%,
Bulgarian,9458,0.3%,
LANGUAGE UNDEFINED,8325,0.3%,
Arabic,7595,0.3%,

0,1
Distinct count,2
Unique (%),0.0%
Missing (%),0.0%
Missing (n),0

0,1
Mean,0.20787

0,1
0,2355286
1,618085

Value,Count,Frequency (%),Unnamed: 3
0,2355286,79.2%,
1,618085,20.8%,

0,1
Distinct count,3
Unique (%),0.0%
Missing (%),55.9%
Missing (n),1662425

0,1
RETWEET,1270702
QUOTE_TWEET,40244
(Missing),1662425

Value,Count,Frequency (%),Unnamed: 3
RETWEET,1270702,42.7%,
QUOTE_TWEET,40244,1.4%,
(Missing),1662425,55.9%,

0,1
Distinct count,896684
Unique (%),30.2%
Missing (%),0.0%
Missing (n),0

0,1
8/16/2017 1:29,202
8/16/2017 1:31,186
8/15/2017 17:01,149
Other values (896681),2972834

Value,Count,Frequency (%),Unnamed: 3
8/16/2017 1:29,202,0.0%,
8/16/2017 1:31,186,0.0%,
8/15/2017 17:01,149,0.0%,
8/16/2017 1:30,146,0.0%,
8/16/2017 1:32,144,0.0%,
8/12/2017 19:11,144,0.0%,
8/15/2017 17:09,143,0.0%,
8/15/2017 17:10,133,0.0%,
8/16/2017 1:28,128,0.0%,
8/15/2017 17:11,128,0.0%,

0,1
Distinct count,37
Unique (%),0.0%
Missing (%),0.3%
Missing (n),8843

0,1
United States,2055882
Unknown,572767
Azerbaijan,100755
Other values (33),235124

Value,Count,Frequency (%),Unnamed: 3
United States,2055882,69.1%,
Unknown,572767,19.3%,
Azerbaijan,100755,3.4%,
United Arab Emirates,74908,2.5%,
Russian Federation,37637,1.3%,
Belarus,29619,1.0%,
Germany,27192,0.9%,
United Kingdom,18062,0.6%,
Italy,13494,0.5%,
Iraq,11219,0.4%,

0,1
Distinct count,2
Unique (%),0.0%
Missing (%),0.0%
Missing (n),0

0,1
Mean,0.4409

0,1
0,1662425
1,1310946

Value,Count,Frequency (%),Unnamed: 3
0,1662425,55.9%,
1,1310946,44.1%,

0,1
Distinct count,97696
Unique (%),3.3%
Missing (%),0.0%
Missing (n),0
Infinite (%),0.0%
Infinite (n),0

0,1
Mean,10498
Minimum,-1
Maximum,166113
Zeros (%),0.0%

0,1
Minimum,-1
5-th percentile,308
Q1,1787
Median,4333
Q3,12341
95-th percentile,39104
Maximum,166113
Range,166114
Interquartile range,10554

0,1
Standard deviation,17687
Coef of variation,1.6849
Kurtosis,33.376
Mean,10498
MAD,10215
Skewness,4.9007
Sum,31213128677
Variance,312840000
Memory size,22.7 MiB

Value,Count,Frequency (%),Unnamed: 3
3,765,0.0%,
5,739,0.0%,
6,734,0.0%,
4,729,0.0%,
7,715,0.0%,
8,700,0.0%,
9,694,0.0%,
2,688,0.0%,
11,663,0.0%,
10,662,0.0%,

Value,Count,Frequency (%),Unnamed: 3
-1,3,0.0%,
1,643,0.0%,
2,688,0.0%,
3,765,0.0%,
4,729,0.0%,

Value,Count,Frequency (%),Unnamed: 3
166109,1,0.0%,
166110,1,0.0%,
166111,1,0.0%,
166112,1,0.0%,
166113,1,0.0%,

Unnamed: 0,external_author_id,author,content,region,language,publish_date,harvested_date,following,followers,updates,post_type,account_type,new_june_2018,retweet,account_category
0,9.06e+17,10_GOP,"""We have a sitting Democrat US Senator on tria...",Unknown,English,10/1/2017 19:58,10/1/2017 19:59,1052,9636,253,,Right,0,0,RightTroll
1,9.06e+17,10_GOP,Marshawn Lynch arrives to game in anti-Trump s...,Unknown,English,10/1/2017 22:43,10/1/2017 22:43,1054,9637,254,,Right,0,0,RightTroll
2,9.06e+17,10_GOP,Daughter of fallen Navy Sailor delivers powerf...,Unknown,English,10/1/2017 22:50,10/1/2017 22:51,1054,9637,255,RETWEET,Right,0,1,RightTroll
3,9.06e+17,10_GOP,JUST IN: President Trump dedicates Presidents ...,Unknown,English,10/1/2017 23:52,10/1/2017 23:52,1062,9642,256,,Right,0,0,RightTroll
4,9.06e+17,10_GOP,"19,000 RESPECTING our National Anthem! #StandF...",Unknown,English,10/1/2017 2:13,10/1/2017 2:13,1050,9645,246,RETWEET,Right,0,1,RightTroll


In [7]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 2973371 entries, 0 to 37554
Data columns (total 15 columns):
external_author_id    float64
author                object
content               object
region                object
language              object
publish_date          object
harvested_date        object
following             int64
followers             int64
updates               int64
post_type             object
account_type          object
new_june_2018         int64
retweet               int64
account_category      object
dtypes: float64(1), int64(5), object(9)
memory usage: 363.0+ MB


## Observations from Inspection / Pandas_Profiling ProfileReport

- **Language to Analyze is in `Content`:**
    - Actual tweet contents. 
 
- **Classification/Analysis Thoughts:**
    - **Variables should be considered in 2 ways:**
        - First, the tweet contents. 
            - Use NLP to engineer features to feed into deep learning.
                - Sentiment analysis, named-entity frequency/types, most-similar words. 
        - Second, the tweet metadata. 
        
### Thoughts on specific features:
- `language`
    - There are 56 unique languages. 
    - 2.4 million are English, 670 K are in Russian, etc.
    - Note: for metadata, analyzing if an account posts in more than 1 language may be a good predictor. 
- `followers`/`following`
    - **following** could be informative if goal is to predict if its a troll tweet.
    - **followers** should be used (with retweets) if predicting retweets based on content. 

### Questions to answer:
- [ ] Why are so many post_types missing? (55%?)
- [ ] How many tweets were written by a russian troll account?
    
### Scrubing to Perform
- **Recast Columns:**
    - [ ] `publish_date` to datetime. 
- **Columns to Discard:**
    - [ ] `harvested_date` (we care about publish_date, if anything, time-wise)
    - [ ] `language`: remove all non-english tweets and drop column
    - [ ] `new_june_2018`

### Reducing Targeted Tweets to Language=English and Retweet=0 Only

- Since the goal is to use NLP to detect which tweets came from Russian trolls, we will only analyze the tweets that were originally created by a known Russian troll account

In [7]:
# Drop non-english rows
df = df.loc[df.language=='English']
df = df.loc[df.retweet==0]
# df.info()

In [8]:
cols_to_drop = ['harvested_date','new_june_2018']#: remove all non-english tweets and drop column

for col in cols_to_drop:
    df.drop(col, axis=1, inplace=True)

df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1272848 entries, 0 to 37431
Data columns (total 13 columns):
external_author_id    1272848 non-null float64
author                1272848 non-null object
content               1272847 non-null object
region                1271703 non-null object
language              1272848 non-null object
publish_date          1272848 non-null object
following             1272848 non-null int64
followers             1272848 non-null int64
updates               1272848 non-null int64
post_type             0 non-null object
account_type          1272502 non-null object
retweet               1272848 non-null int64
account_category      1272848 non-null object
dtypes: float64(1), int64(4), object(8)
memory usage: 136.0+ MB


___
# Save/Load and Resume

In [10]:
save_or_load = input('Would you like to "save" or "load" dataframe?\n("save","load","no"):')

if save_or_load.lower()=='save':
    # Save csv
    df.to_csv('russian_troll_tweets_eng_only_date_pub_index.csv')
    
if save_or_load.lower()=='load':
    import bs_ds as bs
    from bs_ds.imports import *
    # Load csva
    df = pd.read_csv('russian_troll_tweets_eng_only_date_pub_index.csv')    

Would you like to "save" or "load" dataframe?
("save","load","no"):save


### Recasting Publish date as datetime column (date_published)

In [9]:
# Recast date_published as datetime and make index
df['date_published'] = pd.to_datetime(df['publish_date'])
df.set_index('date_published', inplace=True)
print('Changed index to datetime "date_published".')

Changed index to datetime "date_published".


In [10]:
print(df.columns)

Index(['external_author_id', 'author', 'content', 'region', 'language',
       'publish_date', 'following', 'followers', 'updates', 'post_type',
       'account_type', 'retweet', 'account_category'],
      dtype='object')


In [11]:
# Convert publish_date to datetime
# df['date_published'] = pd.to_datetime(df.publish_date)
print(f'Tweet dates from {np.min(df.index)}  to  {np.max(df.index)}')

Tweet dates from 2012-02-06 20:24:00  to  2018-05-30 20:58:00


### Processing Columns

In [13]:
# Drop un-needed columns
# cols_to_drop = ['publish_date','language']
# for col in cols_to_drop:

#     df.drop(col, axis=1, inplace=True)
#     print(f'Dropped {col}.')


# # Recast categorical columns
# cols_to_cats = ['region','post_type','account_type','account_category']
# for col in cols_to_cats:

#     df[col] = df[col].astype('category')
#     print(f'Converted {col} to category.')


# # Drop problematic nan in 'contet'
# df.dropna(subset=['content'],inplace=True) # Dropping the 1 null value 

# df.head()

In [14]:
df.info()

<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 1272848 entries, 2017-10-01 19:58:00 to 2015-08-13 11:19:00
Data columns (total 13 columns):
external_author_id    1272848 non-null float64
author                1272848 non-null object
content               1272847 non-null object
region                1271703 non-null object
language              1272848 non-null object
publish_date          1272848 non-null object
following             1272848 non-null int64
followers             1272848 non-null int64
updates               1272848 non-null int64
post_type             0 non-null object
account_type          1272502 non-null object
retweet               1272848 non-null int64
account_category      1272848 non-null object
dtypes: float64(1), int64(4), object(8)
memory usage: 136.0+ MB


# Thoughts on My Search Strategy

**My Twitter API Link:**<br>
https://api.twitter.com/1.1/tweets/search/fullarchive/search.json


**Inspect Data to get search parameters:**
- [X] Get the date range for the English tweets in the original dataset<br>
    - **Tweet date range:**
        - **2012-02-06** to **2018-05-30**

- [X] Get a list of the hash tags (and their frequencies from the dataframe

**Determine most feasible and balanced well of extracting control tweets**
- [ ] How many of each tag / @'s should I try to exctract?
- [ ] what are the limitations of the API that will be a road block to getting as many tweets as desired?

In [17]:
# Inspect Data to get search parameters:
print(f'Tweet date range:\n {min(df.index)} to {max(df.index)}')
print(f'\nTotal days: {max(df.index)-min(df.index)}')

Tweet date range:
 2012-02-06 20:24:00 to 2018-05-30 20:58:00

Total days: 2305 days 00:34:00


## Determining Hashtags & @'s to search for

- Use regular expressions to extract the hashtags #words and @handles.
- Use the top X many tags as search terms for twitter API
    - There are _1,678,170 unique hashtags_ and _1,165,744 unique @'s_

In [20]:
# NEW: Make a column containing all hashtags and mentions
import re
hashtags = re.compile(r'(\#\w*)')
df['hashtags'] = df['content'].map(lambda x: hashtags.findall(str(x)))

mentions = re.compile(r'(\@\w*)')
df['mentions'] = df['content'].map(lambda x: mentions.findall(str(x)))

urls = re.compile(r"(http[s]?://\w*\.\w*/+\w+)")
df['links'] = df['content'].map(lambda x: urls.findall(str(x)))

In [21]:
df.head()

Unnamed: 0_level_0,external_author_id,author,content,region,language,publish_date,following,followers,updates,post_type,account_type,retweet,account_category,hastags,hashtags,mentions,links
date_published,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1
2017-10-01 19:58:00,9.06e+17,10_GOP,"""We have a sitting Democrat US Senator on tria...",Unknown,English,10/1/2017 19:58,1052,9636,253,,Right,0,RightTroll,[],[],[@nedryun],[https://t.co/gh6g0D1oiC]
2017-10-01 22:43:00,9.06e+17,10_GOP,Marshawn Lynch arrives to game in anti-Trump s...,Unknown,English,10/1/2017 22:43,1054,9637,254,,Right,0,RightTroll,[],[],[],[https://t.co/mLH1i30LZZ]
2017-10-01 23:52:00,9.06e+17,10_GOP,JUST IN: President Trump dedicates Presidents ...,Unknown,English,10/1/2017 23:52,1062,9642,256,,Right,0,RightTroll,[],[],[],[https://t.co/z9wVa4djAE]
2017-10-01 02:47:00,9.06e+17,10_GOP,"Dan Bongino: ""Nobody trolls liberals better th...",Unknown,English,10/1/2017 2:47,1050,9644,247,,Right,0,RightTroll,[],[],[],[https://t.co/AigV93aC8J]
2017-10-01 02:52:00,9.06e+17,10_GOP,'@SenatorMenendez @CarmenYulinCruz Doesn't mat...,Unknown,English,10/1/2017 2:52,1050,9644,249,,Right,0,RightTroll,[],[],"[@SenatorMenendez, @CarmenYulinCruz]",[]


In [None]:
df['hashtags'].value_counts()

# BOOKMARK 06/03

### def get_tags_ats

In [18]:
# Define get_tags_ats to accept a list of text entries and return all found tags and ats as 2 series/lists
def get_tags_ats(text_to_search,exp_tag = r'(#\w*)',exp_at = r'(@\w*)', output='series',show_counts=False):
    """Accepts a list of text entries to search, and a regex for tags, and a regex for @'s.
    Joins all entries in the list of text and then re.findsall() for both expressions.
    Returns a series of found_tags and a series of found_ats.'"""
    import re
    
    # Create a single long joined-list of strings
    text_to_search_combined = ' '.join(text_to_search)
        
    # print(len(text_to_search_combined), len(text_to_search_list))
    found_tags = re.findall(exp_tag, text_to_search_combined)
    found_ats = re.findall(exp_at, text_to_search_combined)
    
    if output.lower() == 'series':
        found_tags = pd.Series(found_tags, name='tags')
        found_ats = pd.Series(found_ats, name='ats')
        
        if show_counts==True:
            print(f'\t{found_tags.name}:\n{tweet_tags.value_counts()} \n\n\t{found_ats.name}:\n{tweet_ats.value_counts()}')
                
    if (output.lower() != 'series') & (show_counts==True):
        raise Exception('output must be set to "series" in order to show_counts')
                       
    return found_tags, found_ats

In [19]:
# Need to get a list of hash tags.
text_to_search_list = []

for i in range(len(df)):    
    tweet_contents =df['content'].iloc[i]
    text_to_search_list.append(tweet_contents)

text_to_search_list[:2]

['"We have a sitting Democrat US Senator on trial for corruption and you\'ve barely heard a peep from the mainstream media." ~ @nedryun https://t.co/gh6g0D1oiC',
 'Marshawn Lynch arrives to game in anti-Trump shirt. Judging by his sagging pants the shirt should say Lynch vs. belt https://t.co/mLH1i30LZZ']

In [20]:
# Get all tweet tags and @'s from text_to_search_list
tweet_tags, tweet_ats = get_tags_ats(text_to_search_list, show_counts=False)

print(f"There were {len(tweet_tags)} unique hashtags and {len(tweet_ats)} unique @'s\n")

# Create a dataframe with top_tags
df_top_tags = pd.DataFrame(tweet_tags.value_counts()[:40])#,'\n')
df_top_tags['% Total'] = (df_top_tags['tags']/len(tweet_tags)*100)

# Create a dataframe with top_ats
df_top_ats = pd.DataFrame(tweet_ats.value_counts()[:40])
df_top_ats['% Total'] = (df_top_ats['ats']/len(tweet_ats)*100)

# Display top tags and ats
# bs.display_side_by_side(df_top_tags,df_top_ats)

There were 832208 unique hashtags and 673442 unique @'s



### Notes on Top Tags and Ats:


In [21]:
# Choose list of top tags to use in search
list_top_30_tags = df_top_tags.index[:30]
list_top_30_tags

Index(['#news', '#sports', '#politics', '#world', '#local', '#TopNews',
       '#health', '#business', '#BlackLivesMatter', '#tech', '#entertainment',
       '#MAGA', '#top', '#Cleveland', '#crime', '#TopVideo', '#environment',
       '#PJNET', '#mar', '#FAKENEWS', '#Miami', '#tcot', '#IslamKills',
       '#topl', '#SanJose', '#life', '#breaking', '#ISIS', '#DemnDebate',
       '#KochFarms'],
      dtype='object')

In [22]:
# Choose list of top tags to use in search
list_top_30_ats = df_top_ats.index[:30]
list_top_30_ats

Index(['@midnight', '@realDonaldTrump', '@WarfareWW', '@CNN',
       '@HillaryClinton', '@POTUS', '@CNNPolitics', '@FoxNews', '@mashable',
       '@YouTube', '@CNNSitRoom', '@AC360', '@VanJones68', '@CNNI',
       '@TheLeadCNN', '@DonLemon', '@JakeTapper', '@AnaNavarro',
       '@BrianStelter', '@AndersonCooper', '@WolfBlitzer', '@truthfeednews',
       '@Jenn_Abrams', '@washingtonpost', '@nytimes', '@jstines3', '@deray',
       '@Acosta', '@', '@todayinsyria'],
      dtype='object')

- The most common tags include some very generic categories that may not be helpful in extracting control tweets.
    - ~~Exclude: '#news','#sports','#politics','#world','#local','#TopNews','#health','#business','#tech',~~
    - On second thought, this is entirely appropriate, since these tags would be what appears in the wild.
    - Additionally, using a larger number of them (like 30, starts to provide more targeted hashtags.<br><br>
  
- **The most common @'s are much more revealing and helpful in narrowing the focus of the results.**

___

# Using the Twitter Search API to Extract Control Tweets

- [x] Required API key are saved in the Main folder in which this repo is saved. 
- [x] Check the [Premium account docs for search syntax](https://developer.twitter.com/en/docs/tweets/search/guides/premium-operators.html)
- [x] [Check this article for using Tweepy for most efficient twitter api extraction](https://bhaskarvk.github.io/2015/01/how-to-use-twitters-search-rest-api-most-effectively./)

**LINK TO PREMIUM SEARCH API GUIDE**<br>
https://developer.twitter.com/en/docs/tweets/search/api-reference/premium-search

**Available search operators**
- Premium search API supports rules with up to 1,024 characters. The Search Tweets APIs support the premium operators listed below. See our Premium operators guide for more details.

- The base URI for the premium search API is https://api.twitter.com/1.1/tweets/search/.

**Matching on Tweet contents:**
- keyword , "quoted phrase" , # , @, url , lang


## Using tweepy to access twitter API

- [Helpful tutorial on _most efficient_ way to access twitter API](https://bhaskarvk.github.io/2015/01/how-to-use-twitters-search-rest-api-most-effectively./)

### def connect_twitter_api, def search_twitter_api

In [23]:
# Initialzie Tweepy with Authorization Keys    
def connect_twitter_api(api_key, api_secret_key):
    import tweepy, sys
    auth = tweepy.AppAuthHandler(api_key, api_secret_key)
    api = tweepy.API(auth, wait_on_rate_limit=True, wait_on_rate_limit_notify=True)

    if (not api):
        print("Can't authenticate.")
        sys.exit(-1)
    return api

In [24]:
def search_twitter_api(api_object, searchQuery, maxTweets, fName, tweetsPerQry=100, max_id=0, sinceId=None):
    """Take an authenticated tweepy api_object, a search queary, max# of tweets to retreive, a desintation filename.
    Uses tweept.api.search for the searchQuery until maxTweets is reached, saved harvest tweets to fName."""
    import sys, jsonpickle, os
    api = api_object
    tweetCount = 0
    print(f'Downloading max{maxTweets} for {searchQuery}...')
    with open(fName, 'a+') as f:
        while tweetCount < maxTweets:

            try:
                if (max_id <=0):
                    if (not sinceId):
                        new_tweets = api.search(q=searchQuery, count=tweetsPerQry, tweet_mode='extended')
                    else:
                        new_tweets = api.search(q=searchQuery, count=tweetsPerQry, since_id=sinceId, tweet_mode='extended')

                else:
                    if (not sinceId):
                        new_tweets = api.search(q=searchQuery, count=tweetsPerQry, max_id=str(max_id-1), tweet_mode='extended')
                    else:
                        new_tweets = api.search(q=searchQuery, count=tweetsPerQry, max_id=str(max_id-1),since_id=sinceId, tweet_mode='extended')

                if not new_tweets:
                    print('No more tweets found')
                    break

                for tweet in new_tweets:
                    f.write(jsonpickle.encode(tweet._json, unpicklable=False)+'\n')

                tweetCount+=len(new_tweets)

                print("Downloaded {0} tweets".format(tweetCount))
                max_id = new_tweets[-1].id

            except tweepy.TweepError as e:
                # Just exit if any error
                print("some error : " + str(e))
                break
    print ("Downloaded {0} tweets, Saved to {1}\n".format(tweetCount, fName))

## Connect to Twitter and Harvest Tweets

### Making lists of tags and ats to query

In [None]:
print(df.index

In [25]:
# df_top_ats.ats[:20], df_top_tags.tags[:20]

In [26]:
# Figure out the # of each @ and each # that i want ot query, then make a query_dict to feed into the cell below
query_ats = tuple(zip(df_top_ats.index, df_top_ats['ats']))
query_tags = tuple(zip(df_top_tags.index, df_top_tags['tags']))

# Calculate how many tweets are represented by the top 30 tags and top 30 @'s 
sum_top_tweet_tags = df_top_tags['tags'].sum()
sum_top_tweet_ats = df_top_ats['ats'].sum()
print(f"Sum of top tags = {sum_top_tweet_tags}\nSum of top @'s = {sum_top_tweet_ats}")

Sum of top tags = 422494
Sum of top @'s = 32925


In [30]:
print(query_ats[:10],'\n')
print(query_tags[:10])

(('@midnight', 6691), ('@realDonaldTrump', 3532), ('@WarfareWW', 1529), ('@CNN', 1471), ('@HillaryClinton', 1424), ('@POTUS', 1035), ('@CNNPolitics', 948), ('@FoxNews', 930), ('@mashable', 740), ('@YouTube', 680)) 

(('#news', 118624), ('#sports', 45544), ('#politics', 37452), ('#world', 27077), ('#local', 23130), ('#TopNews', 14621), ('#health', 10328), ('#business', 9558), ('#BlackLivesMatter', 8252), ('#tech', 7836))


In [87]:
len(query_ats)

40

In [86]:
np.sum([x[1] for x in query_ats])

32925

In [31]:
# Inspect Data to get search parameters:
print(f'Tweet date range:\n {min(df.index)} to {max(df.index)}')
print(f'\nTotal days: {max(df.index)-min(df.index)}')

Tweet date range:
 2012-02-06 20:24:00 to 2018-05-30 20:58:00

Total days: 2305 days 00:34:00


### Reconsidering a More Thoughtful Approach to Top-Tags/@'s to Retreive
- using the most popular tags will become an issue when searching historical log of tweets.
- I am unable to query twitter for specifics dates, which means that I will have to recursively search/paginate all the way back to 05-30-2018 before saving tweets. 
    - **The API limitations mean that this will take MUCH longer for the MOST COMMON hashtags.**

- **-New Decision: Use only @'s to determine the body of tweets to use as a control.**

In [27]:
PAUSE

NameError: name 'PAUSE' is not defined

### Connecting to twitter api and searching for lists of queries

In [85]:
# Import API keys from text files (so not displayed here and not in repo)
with open('../consumer_API_key.txt','r') as f:
    api_key =  f.read()
with open('../consumer_API_secret_key.txt','r') as f:
    api_secret_key  = f.read()

#### Test searches

In [33]:
# Manually connecting to API and doing test searches. 
import tweepy, sys
auth = tweepy.AppAuthHandler(api_key, api_secret_key)
api = tweepy.API(auth, wait_on_rate_limit=True, wait_on_rate_limit_notify=True)

if (not api):
    print("Can't authenticate.")
    sys.exit(-1)

In [34]:
# Search for a batch of test results
searchQuery='#politics'
tweetsPerQry=100

new_tweets = api.search(q=searchQuery, count=tweetsPerQry, tweet_mode='extended')
type(new_tweets)

In [83]:
#  Display time range of new_tweets so i can define a timetrange to test
test_dates = [x.created_at for x in new_tweets]
print(f'Range:{min(test_dates)} to {max(test_dates)}')
test_dates[0], test_dates[-1]

Range:2019-06-02 19:53:06 to 2019-06-02 20:05:48


(datetime.datetime(2019, 6, 2, 20, 5, 48),
 datetime.datetime(2019, 6, 2, 19, 53, 6))

In [72]:
from datetime import datetime
end_time = datetime(2019,6,2,20,0,0)
end_time

datetime.datetime(2019, 6, 2, 20, 0)

In [75]:
## DEFINING A NEW FUNCTION TO EXAMINE THE NEW_TWEETS OUTPUTS
def check_tweet_daterange(new_tweets,timerange_begin,timerange_end,verbose=0):
    """Examines specific information for each tweet in a tweepy searchResults object."""
    
    time_start = timerange_begin
    time_end = timerange_end
    
    # Pull out each tweet's status object. 
    idx_keep_tweets = []
    for i,tweet in enumerate(new_tweets):
        if (tweet.created_at > time_start) and (tweet.created_at < time_end):
            idx_keep_tweets.append(i)
            if verbose>0:
                print(f'tweet({i} kept:{tweet.created_at})')
    return idx_keep_tweets

In [58]:
# Determining search criteria to limit twitter results to
latest_date = max(df.index) # Get latest date from troll tweets
earliest_date = min(df.index) # Get the earliest date from troll tweets

# Convert pandas timestamps to datetime object for tweet results
latest_datetime = latest_date.to_pydatetime()
earliest_datetime = earliest_date.to_pydatetime()

In [88]:
# def search_twitter_api_for_timerange(api_object, searchQuery, timerange_start, timerange_end, maxTweets, fName, tweetsPerQry=100, max_id=0, sinceId=None):
#     """Take an authenticated tweepy api_object, a search queary, a datetime object for earliest date and latest date,
#     max# of tweets to retreive, a desintation filename.
#     Uses tweept.api.search for the searchQuery until maxTweets is reached, saved harvest tweets to fName."""
#     import sys, jsonpickle, os
#     api = api_object
#     tweetCount = 0
#     print(f'Downloading max{maxTweets} for {searchQuery}...')
#     with open(fName, 'a+') as f:
#         while tweetCount < maxTweets:

#             try:
#                 if (max_id <=0):
#                     if (not sinceId):
#                         new_tweets = api.search(q=searchQuery, count=tweetsPerQry, tweet_mode='extended')
#                     else:
#                         new_tweets = api.search(q=searchQuery, count=tweetsPerQry, since_id=sinceId, tweet_mode='extended')

#                 else:
#                     if (not sinceId):
#                         new_tweets = api.search(q=searchQuery, count=tweetsPerQry, max_id=str(max_id-1), tweet_mode='extended')
#                     else:
#                         new_tweets = api.search(q=searchQuery, count=tweetsPerQry, max_id=str(max_id-1),since_id=sinceId, tweet_mode='extended')

#                 if not new_tweets:
#                     print('No more tweets found')
#                     break
                
#                 max_id = new_tweets[-1].id
#                 # Insert check for timerange
#                 new_tweets_idx_keep = check_tweet_daterange(new_tweets,test_dates[-1],end_time)
#                 kept_new_tweets = [new_tweets[i] for i in new_tweets_idx_keep]

                    
#                 for tweet in kept_new_tweets:
#                     f.write(jsonpickle.encode(tweet._json, unpicklable=False)+'\n')

#                 tweetCount+=len(kept_new_tweets)

#                 print("Downloaded {0} tweets".format(tweetCount))

#             except tweepy.TweepError as e:
#                 # Just exit if any error
#                 print("some error : " + str(e))
#                 break
#     print ("Downloaded {0} tweets, Saved to {1}\n".format(tweetCount, fName))

In [89]:
# # Get index of tweets in time range
# new_tweets_idx_keep = check_tweet_daterange(new_tweets,test_dates[-1],end_time)

# kept_new_tweets = [new_tweets[i] for i in new_tweets_idx_keep]
# kept_new_tweets[0]

#### Automated Searches:

In [None]:
pause

In [None]:
api = connect_twitter_api(api_key,api_secret_key)

In [90]:
# Extract tweets for top @'s, while matching the distribution of top @'s

final_query_list = query_ats
filename = 'tweets_for_top40_ats.txt'

for q in final_query_list:
    searchQuery = q[0]
    maxTweets = q[1]
    print(f'Query={searchQuery}, max={maxTweets}')
    search_twitter_api(api, searchQuery, maxTweets, fName=filename)

Query=@midnight, max=6691
Downloading max6691 for @midnight...
Downloaded 91 tweets
Downloaded 173 tweets
Downloaded 220 tweets
No more tweets found
Downloaded 220 tweets, Saved to tweets_for_top3_ats.txt

Query=@realDonaldTrump, max=3532
Downloading max3532 for @realDonaldTrump...
Downloaded 71 tweets
Downloaded 150 tweets
Downloaded 225 tweets
Downloaded 301 tweets
Downloaded 380 tweets
Downloaded 461 tweets
Downloaded 543 tweets
Downloaded 632 tweets
Downloaded 715 tweets
Downloaded 794 tweets
Downloaded 873 tweets
Downloaded 953 tweets
Downloaded 1022 tweets
Downloaded 1097 tweets
Downloaded 1178 tweets
Downloaded 1258 tweets
Downloaded 1333 tweets
Downloaded 1411 tweets
Downloaded 1488 tweets
Downloaded 1559 tweets
Downloaded 1635 tweets
Downloaded 1712 tweets
Downloaded 1796 tweets
Downloaded 1871 tweets
Downloaded 1950 tweets
Downloaded 2027 tweets
Downloaded 2106 tweets
Downloaded 2179 tweets
Downloaded 2235 tweets
Downloaded 2304 tweets
Downloaded 2371 tweets
Downloaded 2461 t

Downloaded 79 tweets
Downloaded 153 tweets
Downloaded 222 tweets
Downloaded 297 tweets
Downloaded 379 tweets
Downloaded 463 tweets
Downloaded 463 tweets, Saved to tweets_for_top3_ats.txt

Query=@nytimes, max=444
Downloading max444 for @nytimes...
Downloaded 91 tweets
Downloaded 172 tweets
Downloaded 259 tweets
Downloaded 352 tweets
Downloaded 444 tweets
Downloaded 444 tweets, Saved to tweets_for_top3_ats.txt

Query=@jstines3, max=413
Downloading max413 for @jstines3...
Downloaded 69 tweets
Downloaded 142 tweets
Downloaded 210 tweets
Downloaded 268 tweets
Downloaded 333 tweets
Downloaded 395 tweets
Downloaded 446 tweets
Downloaded 446 tweets, Saved to tweets_for_top3_ats.txt

Query=@deray, max=380
Downloading max380 for @deray...
Downloaded 80 tweets
Downloaded 165 tweets
Downloaded 261 tweets
Downloaded 341 tweets
Downloaded 434 tweets
Downloaded 434 tweets, Saved to tweets_for_top3_ats.txt

Query=@Acosta, max=367
Downloading max367 for @Acosta...
Downloaded 58 tweets
Downloaded 109 tw

In [None]:
# # Extract tweets for top #
# final_query_list = query_tags[:3]
# filename = 'tweets_for_top3_tags.txt'

# for q in final_query_list:
#     searchQuery = q[0]
#     maxTweets = q[1]
#     print(f'Query={searchQuery}, max={maxTweets}')
#     search_twitter_api(api, searchQuery, maxTweets, fName=filename)

___

In [91]:
df_ats = pd.read_json('tweets_for_top40_ats.txt', lines=True)

In [94]:
df_ats.head()

Unnamed: 0,contributors,coordinates,created_at,display_text_range,entities,extended_entities,favorite_count,favorited,full_text,geo,...,quoted_status,quoted_status_id,quoted_status_id_str,retweet_count,retweeted,retweeted_status,source,truncated,user,withheld_in_countries
0,,,2019-06-02 18:34:59,"[0, 202]","{'hashtags': [{'indices': [48, 66], 'text': 'd...",,0,False,@realDonaldTrump it’s perfectly reasonable tha...,,...,,,,0,False,,"<a href=""http://twitter.com/download/iphone"" r...",False,"{'contributors_enabled': False, 'created_at': ...",
1,,,2019-06-02 18:34:59,"[0, 139]","{'hashtags': [], 'symbols': [], 'urls': [], 'u...",,0,False,RT @BelkissObadia: BREAKING NEWS: \n\n@realDon...,,...,,,,950,False,"{'contributors': None, 'coordinates': None, 'c...","<a href=""http://twitter.com/download/android"" ...",False,"{'contributors_enabled': False, 'created_at': ...",
2,,,2019-06-02 18:34:59,"[17, 207]","{'hashtags': [], 'symbols': [], 'urls': [], 'u...",,0,False,@realDonaldTrump I thought you were supposed t...,,...,,,,0,False,,"<a href=""http://twitter.com/download/android"" ...",False,"{'contributors_enabled': False, 'created_at': ...",
3,,,2019-06-02 18:34:59,"[0, 139]","{'hashtags': [], 'symbols': [], 'urls': [], 'u...",,0,False,RT @realDonaldTrump: Mexico is sending a big d...,,...,,,,3128,False,"{'contributors': None, 'coordinates': None, 'c...","<a href=""http://twitter.com/#!/download/ipad"" ...",False,"{'contributors_enabled': False, 'created_at': ...",
4,,,2019-06-02 18:34:59,"[0, 139]","{'hashtags': [], 'symbols': [], 'urls': [], 'u...",,0,False,RT @realDonaldTrump: Mexico is sending a big d...,,...,,,,3128,False,"{'contributors': None, 'coordinates': None, 'c...","<a href=""http://twitter.com/download/android"" ...",False,"{'contributors_enabled': False, 'created_at': ...",


In [96]:
df_ats.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 44127 entries, 0 to 44126
Data columns (total 32 columns):
contributors                 0 non-null float64
coordinates                  6 non-null object
created_at                   44127 non-null datetime64[ns]
display_text_range           44127 non-null object
entities                     44127 non-null object
extended_entities            4875 non-null object
favorite_count               44127 non-null int64
favorited                    44127 non-null bool
full_text                    44127 non-null object
geo                          6 non-null object
id                           44127 non-null int64
id_str                       44127 non-null int64
in_reply_to_screen_name      20622 non-null object
in_reply_to_status_id        19352 non-null float64
in_reply_to_status_id_str    19352 non-null float64
in_reply_to_user_id          20622 non-null float64
in_reply_to_user_id_str      20622 non-null float64
is_quote_status              

In [128]:

# df_export['external_author_id'] = df_ats.loc[index=['user']['id']
# df_export['author'] = df_ats['user']['screen_name']
# df_export['content'] = df_ats['full_text']

# df_export['region'] = df_ats['user']['location']

# df_export['following'] = df_ats['user']['following']
# df_export['followers'] = df_ats['user']['followers_count']
# df_export['updates'] = np.nan
# df_export['post_type'] = 'control'
# df_export['account_type'] = 'control'
# df_export['retweet'] = df_ats['retweeted']
# df_export['account_category'] = 'control' 
# df_export['publish_date'] = df_ats['created_at']
# df_export['date_published'] = pd.to_datetime(df_ats['created_at'])
# df_export.set_index('date_published',inplace=True)
# df_export['language'] = df_ats['lang']

In [12]:
print(df.columns)

Index(['external_author_id', 'author', 'content', 'region', 'language',
       'publish_date', 'following', 'followers', 'updates', 'post_type',
       'account_type', 'retweet', 'account_category'],
      dtype='object')


### Notes on Making New Extracted Tweets Match Russian Troll Tweet Database

- Columns to be renamed/reformatted to match troll tweets:
    - created_at -> 'date_published'-> index
    - full_text -> 'content'
    - df['user'].
        - .['followers_count'] -> 'following'
        - .['followers_count'] -> 'followers'
        - .['screen_name'] -> 
        - .['id'] ->
- Columns missing from original troll tweets (to be removed).
    -coordinates, favorited, favorite_count, display_text_range, withheld_in_countries
    

In [116]:
df_ats['lang'][1]

'en'

In [139]:
df_ats.user[10]

{'contributors_enabled': False,
 'created_at': 'Sat Jan 03 21:34:24 +0000 2015',
 'default_profile': True,
 'default_profile_image': False,
 'description': 'Married Trump supporter. WWG1WGA, Maga,',
 'entities': {'description': {'urls': []}},
 'favourites_count': 28851,
 'follow_request_sent': None,
 'followers_count': 497,
 'following': None,
 'friends_count': 653,
 'geo_enabled': False,
 'has_extended_profile': False,
 'id': 2959156661,
 'id_str': '2959156661',
 'is_translation_enabled': False,
 'is_translator': False,
 'lang': 'en',
 'listed_count': 0,
 'location': '',
 'name': 'lorna deephouse',
 'notifications': None,
 'profile_background_color': 'C0DEED',
 'profile_background_image_url': 'http://abs.twimg.com/images/themes/theme1/bg.png',
 'profile_background_image_url_https': 'https://abs.twimg.com/images/themes/theme1/bg.png',
 'profile_background_tile': False,
 'profile_banner_url': 'https://pbs.twimg.com/profile_banners/2959156661/1533029109',
 'profile_image_url': 'http://pb

In [97]:
df_ats.entities[0]

{'hashtags': [{'indices': [48, 66], 'text': 'draftdodgingdonny'}],
 'symbols': [],
 'urls': [],
 'user_mentions': [{'id': 25073877,
   'id_str': '25073877',
   'indices': [0, 16],
   'name': 'Donald J. Trump',
   'screen_name': 'realDonaldTrump'}]}

In [111]:
print(df_ats['user'][1]['location'])

Bradenton, FL


In [103]:
print(df_ats['user'][0]['id'])
print(df_ats['user'][0]['screen_name'])
print(df_ats['user'][0]['followers_count'])
print(df_ats['user'][0]['following'])

1114658021785919488
Draftdodgingdon
0
None


In [129]:
df_test=pd.DataFrame()

In [136]:
idx_row = df_ats.index[2]
curr_row = df_ats.loc[df_ats.index==idx_row]
curr_author = curr_row['user']
curr_author

Int64Index([2], dtype='int64')

In [140]:
# df_test.loc[column==['new_col'],index==idx_row] = curr_row['user']
list(df.columns)

['external_author_id',
 'author',
 'content',
 'region',
 'following',
 'followers',
 'updates',
 'post_type',
 'account_type',
 'retweet',
 'account_category']

In [141]:
df_columns_list =['external_author_id', 'author', 'content', 'region', 'following', 'followers', 'updates', 'post_type',
 'account_type', 'retweet', 'account_category']
df_export = pd.DataFrame(columns=df_columns_list)

In [145]:
df_export.loc[0,'external_author_id']='test'
df_export

Unnamed: 0,external_author_id,author,content,region,following,followers,updates,post_type,account_type,retweet,account_category
0,test,,,,,,,,,,


In [149]:
curr_author[0]

{'contributors_enabled': False,
 'created_at': 'Sat Apr 06 22:35:56 +0000 2019',
 'default_profile': True,
 'default_profile_image': True,
 'description': 'Trump Hater',
 'entities': {'description': {'urls': []}},
 'favourites_count': 0,
 'follow_request_sent': None,
 'followers_count': 0,
 'following': None,
 'friends_count': 0,
 'geo_enabled': False,
 'has_extended_profile': False,
 'id': 1114658021785919488,
 'id_str': '1114658021785919488',
 'is_translation_enabled': False,
 'is_translator': False,
 'lang': 'en',
 'listed_count': 0,
 'location': '',
 'name': 'Jimmy',
 'notifications': None,
 'profile_background_color': 'F5F8FA',
 'profile_background_image_url': None,
 'profile_background_image_url_https': None,
 'profile_background_tile': False,
 'profile_image_url': 'http://abs.twimg.com/sticky/default_profile_images/default_profile_normal.png',
 'profile_image_url_https': 'https://abs.twimg.com/sticky/default_profile_images/default_profile_normal.png',
 'profile_link_color': '1DA

In [158]:
full_text=curr_row['ret']
full_text

0    @realDonaldTrump it’s perfectly reasonable tha...
Name: full_text, dtype: object

In [165]:
curr_row['user']
row

1

In [166]:
df_export=pd.DataFrame()
df_columns_list =['external_author_id', 'author', 'content', 'region', 'following', 'followers', 'updates', 'post_type',
 'account_type', 'retweet', 'account_category']

df_export = pd.DataFrame(columns=df_columns_list)

for row in df_ats.index:
    
    curr_row = df_ats.loc[df_ats.index==row]
    curr_author = curr_row['user'][row]
    external_author_id = curr_author['id']
    author =  curr_author['screen_name']
    following = curr_author['following']
    followers = curr_author['followers_count']
    region = curr_author['location']
    full_text = curr_row['full_text'][row]
    
    df_export.loc[row, 'external_author_id'] = external_author_id
    df_export.loc[row, 'author'] = author
    df_export.loc[row, 'content'] = full_text
    df_export.loc[row, 'region'] = region
    df_export.loc[row, 'following'] = following
    df_export.loc[row, 'followers'] = followers
    df_export.loc[row, 'updates'] = np.nan
    df_export.loc[row, 'post_type'] = 'control'
    df_export.loc[row, 'account_type'] = 'control'
    df_export.loc[row, 'retweet'] = curr_row['retweeted'][row]
    df_export.loc[row, 'account_category'] = 'control' 
    df_export.loc[row, 'publish_date'] = curr_row['created_at'][row]
    df_export.loc[row, 'language'] = curr_row['lang'][row]

In [167]:
df_export.head()

Unnamed: 0,external_author_id,author,content,region,following,followers,updates,post_type,account_type,retweet,account_category,publish_date,language
0,1.114658e+18,Draftdodgingdon,@realDonaldTrump it’s perfectly reasonable tha...,,,0.0,,control,control,False,control,2019-06-02 18:34:59,en
1,1.038424e+18,beapartofthemo1,RT @BelkissObadia: BREAKING NEWS: \n\n@realDon...,"Bradenton, FL",,1198.0,,control,control,False,control,2019-06-02 18:34:59,en
2,8.798349e+17,BarbHuber9,@realDonaldTrump I thought you were supposed t...,,,1099.0,,control,control,False,control,2019-06-02 18:34:59,en
3,4765020000.0,nonamehombre,RT @realDonaldTrump: Mexico is sending a big d...,,,69.0,,control,control,False,control,2019-06-02 18:34:59,en
4,8.244085e+17,letbuildthewall,RT @realDonaldTrump: Mexico is sending a big d...,,,18.0,,control,control,False,control,2019-06-02 18:34:59,en


In [168]:
df_export.to_csv('newly_extracted_control_tweets.csv')

In [169]:
from pandas_profiling import ProfileReport
ProfileReport(df_export)

0,1
Number of variables,13
Number of observations,44127
Total Missing (%),15.4%
Total size in memory,6.0 MiB
Average record size in memory,141.7 B

0,1
Numeric,2
Categorical,4
Boolean,0
Date,1
Text (Unique),0
Rejected,6
Unsupported,0

0,1
Constant value,control

0,1
Constant value,control

0,1
Distinct count,28607
Unique (%),64.8%
Missing (%),0.0%
Missing (n),0

0,1
Anthony18392503,65
Sharicabak,63
wildwillow65,47
Other values (28604),43952

Value,Count,Frequency (%),Unnamed: 3
Anthony18392503,65,0.1%,
Sharicabak,63,0.1%,
wildwillow65,47,0.1%,
jenniferclmn,46,0.1%,
MustBeTheMeds,46,0.1%,
Jedi4sss,38,0.1%,
insertgeekname,36,0.1%,
Nora88333625,33,0.1%,
BriMichelle75,32,0.1%,
donaldtrumptr01,31,0.1%,

0,1
Distinct count,26489
Unique (%),60.0%
Missing (%),0.0%
Missing (n),0

0,1
"RT @realDonaldTrump: Mexico is sending a big delegation to talk about the Border. Problem is, they’ve been “talking” for 25 years. We want…",1994
"RT @realDonaldTrump: Peggy Noonan, the simplistic writer for Trump Haters all, is stuck in the past glory of Reagan and has no idea what is…",852
"RT @realDonaldTrump: I never called Meghan Markle “nasty.” Made up by the Fake News Media, and they got caught cold! Will @CNN, @nytimes an…",406
Other values (26486),40875

Value,Count,Frequency (%),Unnamed: 3
"RT @realDonaldTrump: Mexico is sending a big delegation to talk about the Border. Problem is, they’ve been “talking” for 25 years. We want…",1994,4.5%,
"RT @realDonaldTrump: Peggy Noonan, the simplistic writer for Trump Haters all, is stuck in the past glory of Reagan and has no idea what is…",852,1.9%,
"RT @realDonaldTrump: I never called Meghan Markle “nasty.” Made up by the Fake News Media, and they got caught cold! Will @CNN, @nytimes an…",406,0.9%,
"RT @jaketapper: “The party told you to reject the evidence of your eyes and ears. It was their final, most essential command.” — Orwell, 1…",386,0.9%,
"RT @CNNSitRoom: CNN's Jim @Acosta: ""Pres. Trump gave fact-checkers quite a workout today as he opened up a fire hose of falsehoods on speci…",379,0.9%,
"RT @realDonaldTrump: BIG NEWS! As I promised two weeks ago, the first shipment of LNG has just left the Cameron LNG Export Facility in Loui…",317,0.7%,
RT @jaketapper: JUST IN: Voicemail transcript from Trump’s lawyer to Flynn’s is released @ShimonPro reports @TheLeadCNN https://t.co/qC0Jjn…,280,0.6%,
RT @HillaryClinton: .@JulietteKayyem makes a chilling point in her @washingtonpost op-ed today about the way shooters' use of barrel suppre…,215,0.5%,
RT @deray: I’m just 15 mins into #WhenTheySeeUs and it’s almost too much. People call us conspiracy theorists when we explain stories like…,207,0.5%,
RT @LawWorksAction: [MUST WATCH] @JudgeNap and @FoxNews anchors all agree: Special Counsel Mueller did NOT exonerate @realDonaldTrump and f…,197,0.4%,

0,1
Distinct count,28606
Unique (%),64.8%
Missing (%),0.0%
Missing (n),0
Infinite (%),0.0%
Infinite (n),0

0,1
Mean,3.8586e+17
Minimum,4999
Maximum,1.1354e+18
Zeros (%),0.0%

0,1
Minimum,4999.0
5-th percentile,22309000.0
Q1,301660000.0
Median,2851100000.0
Q3,8.8392e+17
95-th percentile,1.1086e+18
Maximum,1.1354e+18
Range,1.1354e+18
Interquartile range,8.8392e+17

0,1
Standard deviation,4.6954e+17
Coef of variation,1.2169
Kurtosis,-1.6718
Mean,3.8586e+17
MAD,4.5457e+17
Skewness,0.45742
Sum,1.7027e+22
Variance,2.2047e+35
Memory size,689.5 KiB

Value,Count,Frequency (%),Unnamed: 3
1.045423275642368e+18,65,0.1%,
9.407896091721646e+17,63,0.1%,
2347323266.0,47,0.1%,
2218000892.0,46,0.1%,
2377480808.0,46,0.1%,
3433002136.0,38,0.1%,
16028055.0,36,0.1%,
1.0855459815617126e+18,33,0.1%,
8.266720719000658e+17,32,0.1%,
9.854336894663107e+17,31,0.1%,

Value,Count,Frequency (%),Unnamed: 3
4999.0,1,0.0%,
5357.0,1,0.0%,
12514.0,5,0.0%,
246103.0,2,0.0%,
675883.0,1,0.0%,

Value,Count,Frequency (%),Unnamed: 3
1.1353602125573448e+18,1,0.0%,
1.1353611193574196e+18,1,0.0%,
1.1353632286058496e+18,1,0.0%,
1.135364420295635e+18,1,0.0%,
1.1353661123534276e+18,1,0.0%,

0,1
Distinct count,6029
Unique (%),13.7%
Missing (%),0.0%
Missing (n),0
Infinite (%),0.0%
Infinite (n),0

0,1
Mean,5360
Minimum,0
Maximum,42023000
Zeros (%),1.2%

0,1
Minimum,0
5-th percentile,6
Q1,67
Median,310
Q3,1478
95-th percentile,9818
Maximum,42023000
Range,42023000
Interquartile range,1411

0,1
Standard deviation,285230
Coef of variation,53.214
Kurtosis,21347
Mean,5360
MAD,8378.1
Skewness,145.03
Sum,236520000
Variance,81355000000
Memory size,689.5 KiB

Value,Count,Frequency (%),Unnamed: 3
0.0,510,1.2%,
1.0,405,0.9%,
2.0,375,0.8%,
3.0,310,0.7%,
6.0,271,0.6%,
7.0,271,0.6%,
8.0,270,0.6%,
4.0,262,0.6%,
9.0,261,0.6%,
5.0,253,0.6%,

Value,Count,Frequency (%),Unnamed: 3
0.0,510,1.2%,
1.0,405,0.9%,
2.0,375,0.8%,
3.0,310,0.7%,
4.0,262,0.6%,

Value,Count,Frequency (%),Unnamed: 3
1028503.0,1,0.0%,
1608557.0,1,0.0%,
3535731.0,1,0.0%,
3982803.0,1,0.0%,
42023320.0,2,0.0%,

0,1
Constant value,

0,1
Distinct count,41
Unique (%),0.1%
Missing (%),0.0%
Missing (n),0

0,1
en,39086
und,3445
es,575
Other values (38),1021

Value,Count,Frequency (%),Unnamed: 3
en,39086,88.6%,
und,3445,7.8%,
es,575,1.3%,
ja,187,0.4%,
in,81,0.2%,
fr,79,0.2%,
pt,74,0.2%,
tl,67,0.2%,
pl,59,0.1%,
de,51,0.1%,

0,1
Constant value,control

0,1
Distinct count,16057
Unique (%),36.4%
Missing (%),0.0%
Missing (n),0
Infinite (%),0.0%
Infinite (n),0

0,1
Minimum,2019-05-24 03:23:03
Maximum,2019-06-03 02:09:16

0,1
Distinct count,8558
Unique (%),19.4%
Missing (%),0.0%
Missing (n),0

0,1
,16021
United States,1889
USA,574
Other values (8555),25643

Value,Count,Frequency (%),Unnamed: 3
,16021,36.3%,
United States,1889,4.3%,
USA,574,1.3%,
"California, USA",540,1.2%,
"Florida, USA",464,1.1%,
"Texas, USA",450,1.0%,
"Los Angeles, CA",243,0.6%,
"New York, USA",212,0.5%,
"Washington, DC",203,0.5%,
"New York, NY",180,0.4%,

0,1
Constant value,False

0,1
Constant value,

Unnamed: 0,external_author_id,author,content,region,following,followers,updates,post_type,account_type,retweet,account_category,publish_date,language
0,1.114658e+18,Draftdodgingdon,@realDonaldTrump it’s perfectly reasonable tha...,,,0.0,,control,control,False,control,2019-06-02 18:34:59,en
1,1.038424e+18,beapartofthemo1,RT @BelkissObadia: BREAKING NEWS: \n\n@realDon...,"Bradenton, FL",,1198.0,,control,control,False,control,2019-06-02 18:34:59,en
2,8.798349e+17,BarbHuber9,@realDonaldTrump I thought you were supposed t...,,,1099.0,,control,control,False,control,2019-06-02 18:34:59,en
3,4765020000.0,nonamehombre,RT @realDonaldTrump: Mexico is sending a big d...,,,69.0,,control,control,False,control,2019-06-02 18:34:59,en
4,8.244085e+17,letbuildthewall,RT @realDonaldTrump: Mexico is sending a big d...,,,18.0,,control,control,False,control,2019-06-02 18:34:59,en


In [117]:
df_export.['date_published'] = pd.to_datetime(df_export['publish_date'])    
df_export.set_index('date_published',inplace=True)

KeyError: 'id'

In [170]:
df.info()

<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 1272847 entries, 2017-10-01 19:58:00 to 2015-08-13 11:19:00
Data columns (total 11 columns):
external_author_id    1272847 non-null float64
author                1272847 non-null object
content               1272847 non-null object
region                1271702 non-null category
following             1272847 non-null int64
followers             1272847 non-null int64
updates               1272847 non-null int64
post_type             0 non-null category
account_type          1272501 non-null category
retweet               1272847 non-null int64
account_category      1272847 non-null category
dtypes: category(4), float64(1), int64(4), object(2)
memory usage: 82.5+ MB


# Moving on to NLP Analysis while dataset is reconsidered

In [None]:
df_data = pd.DataFrame()
for at in query_ats:
    df_data

In [None]:
# First, get all text from original df and new top 40 ats into a dataframe (or list?)
# only need content and label
df_data = 