# 3 - Million Russian Troll Tweets
- James M Irving, Ph.D.
- Mod 4 Project
- Flatiron Full Time Data Science Bootcamp - 02/2019 Cohort

## GOAL: 

- *IF I can get a control dataset* of non-Troll tweets from same time period with similar hashtags:*
    - Use NLP to predict of a tweet is from an authentic user or a Russian troll.
- *If no control tweets to compare to*
    - Use NLP to predict how many retweets a Troll tweet will get.
    - Consider both raw # of retweets, as well as a normalized # of retweets/# of followers.
        - The latter would give better indication of language's effect on propagation. 
        

## Dataset Features:
- Kaggle Dataset published by FiveThirtyEight
    - https://www.kaggle.com/fivethirtyeight/russian-troll-tweets/downloads/russian-troll-tweets.zip/2
<br>    
- Data is split into 9 .csv files
    - 'IRAhandle_tweets_1.csv' to 9

- **Variables:**
    - ~~`external_author_id` | An author account ID from Twitter~~
    - `author` | The handle sending the tweet
    - `content` | The text of the tweet
    - `region` | A region classification, as [determined by Social Studio](https://help.salesforce.com/articleView?   id=000199367&type=1)
    - `language` | The language of the tweet
    - `publish_date` | The date and time the tweet was sent
    - ~~`harvested_date` | The date and time the tweet was collected by Social Studio~~
    - `following` | The number of accounts the handle was following at the time of the tweet
    - `followers` | The number of followers the handle had at the time of the tweet
    - `updates` | The number of “update actions” on the account that authored the tweet, including tweets, retweets and likes
    - `post_type` | Indicates if the tweet was a retweet or a quote-tweet *[Whats a quote-tweet?]*
    - `account_type` | Specific account theme, as coded by Linvill and Warren
    - `retweet` | A binary indicator of whether or not the tweet is a retweet [?]
    - `account_category` | General account theme, as coded by Linvill and Warren
    - `new_june_2018` | A binary indicator of whether the handle was newly listed in June 2018
    
### **Classification of account_type**
Taken from: [rcmediafreedom.eu summary](https://www.rcmediafreedom.eu/Publications/Academic-sources/Troll-Factories-The-Internet-Research-Agency-and-State-Sponsored-Agenda-Building)

>- **They identified five categories of IRA-associated Twitter accounts, each with unique patterns of behaviors:**
    - **Right Troll**, spreading nativist and right-leaning populist messages. It supported the candidacy and Presidency of Donald Trump and denigrated the Democratic Party. It often sent divisive messages about mainstream and moderate Republicans.
    - **Left Troll**, sending socially liberal messages and discussing gender, sexual, religious, and -especially- racial identity. Many tweets seemed intentionally divisive, attacking mainstream Democratic politicians, particularly Hillary Clinton, while supporting Bernie Sanders prior to the election.
    - **News Feed**, overwhelmingly presenting themselves as U.S. local news aggregators, linking to legitimate regional news sources and tweeting about issues of local interest.
    - **Hashtag Gamer**, dedicated almost exclusively to playing hashtag games.
    - **Fearmonger**: spreading a hoax about poisoned turkeys near the 2015 Thanksgiving holiday.

>The different types of account were used differently and their efforts were conducted systematically, with different allocation when faced with different political circumstances or shifting goals. E.g.: there was a spike of activity by right and left troll accounts before the publication of John Podesta's emails by WikiLeaks. According to the authors, this activity can be characterised as “industrialized political warfare”.

___

In [38]:
import bs_ds as bs
from bs_ds.imports import *

In [39]:
import os
root_dir = 'russian-troll-tweets/'
# os.listdir('russian-troll-tweets/')
filelist = [os.path.join(root_dir,file) for file in os.listdir(root_dir) if file.endswith('.csv')]
filelist

['russian-troll-tweets/IRAhandle_tweets_1.csv',
 'russian-troll-tweets/IRAhandle_tweets_2.csv',
 'russian-troll-tweets/IRAhandle_tweets_3.csv',
 'russian-troll-tweets/IRAhandle_tweets_4.csv',
 'russian-troll-tweets/IRAhandle_tweets_5.csv',
 'russian-troll-tweets/IRAhandle_tweets_6.csv',
 'russian-troll-tweets/IRAhandle_tweets_7.csv',
 'russian-troll-tweets/IRAhandle_tweets_8.csv',
 'russian-troll-tweets/IRAhandle_tweets_9.csv']

In [40]:
# Previewing dataset
df = pd.read_csv(filelist[0])
df.head(3)

Unnamed: 0,external_author_id,author,content,region,language,publish_date,harvested_date,following,followers,updates,post_type,account_type,new_june_2018,retweet,account_category
0,9.06e+17,10_GOP,"""We have a sitting Democrat US Senator on tria...",Unknown,English,10/1/2017 19:58,10/1/2017 19:59,1052,9636,253,,Right,0,0,RightTroll
1,9.06e+17,10_GOP,Marshawn Lynch arrives to game in anti-Trump s...,Unknown,English,10/1/2017 22:43,10/1/2017 22:43,1054,9637,254,,Right,0,0,RightTroll
2,9.06e+17,10_GOP,Daughter of fallen Navy Sailor delivers powerf...,Unknown,English,10/1/2017 22:50,10/1/2017 22:51,1054,9637,255,RETWEET,Right,0,1,RightTroll


## Merging full dataset

In [43]:
# Vertically concatenate 
df = pd.DataFrame()
for file in filelist:
    df_new = pd.read_csv(file)
    df = pd.concat([df,df_new], axis=0)
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 2973371 entries, 0 to 37554
Data columns (total 15 columns):
external_author_id    float64
author                object
content               object
region                object
language              object
publish_date          object
harvested_date        object
following             int64
followers             int64
updates               int64
post_type             object
account_type          object
new_june_2018         int64
retweet               int64
account_category      object
dtypes: float64(1), int64(5), object(9)
memory usage: 363.0+ MB


In [44]:
df.head(2)

Unnamed: 0,external_author_id,author,content,region,language,publish_date,harvested_date,following,followers,updates,post_type,account_type,new_june_2018,retweet,account_category
0,9.06e+17,10_GOP,"""We have a sitting Democrat US Senator on tria...",Unknown,English,10/1/2017 19:58,10/1/2017 19:59,1052,9636,253,,Right,0,0,RightTroll
1,9.06e+17,10_GOP,Marshawn Lynch arrives to game in anti-Trump s...,Unknown,English,10/1/2017 22:43,10/1/2017 22:43,1054,9637,254,,Right,0,0,RightTroll


# SCRUB / EDA

In [45]:
from pandas_profiling import ProfileReport
ProfileReport(df)

0,1
Number of variables,16
Number of observations,2973371
Total Missing (%),3.5%
Total size in memory,363.0 MiB
Average record size in memory,128.0 B

0,1
Numeric,5
Categorical,9
Boolean,2
Date,0
Text (Unique),0
Rejected,0
Unsupported,0

0,1
Distinct count,8
Unique (%),0.0%
Missing (%),0.0%
Missing (n),0

0,1
NonEnglish,837725
RightTroll,719087
NewsFeed,599294
Other values (5),817265

Value,Count,Frequency (%),Unnamed: 3
NonEnglish,837725,28.2%,
RightTroll,719087,24.2%,
NewsFeed,599294,20.2%,
LeftTroll,427811,14.4%,
HashtagGamer,241827,8.1%,
Commercial,122582,4.1%,
Unknown,13905,0.5%,
Fearmonger,11140,0.4%,

0,1
Distinct count,21
Unique (%),0.0%
Missing (%),0.0%
Missing (n),363

0,1
Russian,721191
Right,718619
local,460197
Other values (17),1073001

Value,Count,Frequency (%),Unnamed: 3
Russian,721191,24.3%,
Right,718619,24.2%,
local,460197,15.5%,
left,427811,14.4%,
Hashtager,241827,8.1%,
news,139097,4.7%,
Commercial,122582,4.1%,
German,91851,3.1%,
Italian,15899,0.5%,
?,13542,0.5%,

0,1
Distinct count,2848
Unique (%),0.1%
Missing (%),0.0%
Missing (n),0

0,1
EXQUOTE,59652
SCREAMYMONKEY,44041
WORLDNEWSPOLI,36974
Other values (2845),2832704

Value,Count,Frequency (%),Unnamed: 3
EXQUOTE,59652,2.0%,
SCREAMYMONKEY,44041,1.5%,
WORLDNEWSPOLI,36974,1.2%,
AMELIEBALDWIN,35371,1.2%,
TODAYPITTSBURGH,33602,1.1%,
SPECIALAFFAIR,32588,1.1%,
SEATTLE_POST,30800,1.0%,
FINDDIET,29038,1.0%,
KANSASDAILYNEWS,28890,1.0%,
ROOMOFRUMOR,28360,1.0%,

0,1
Distinct count,2365943
Unique (%),79.6%
Missing (%),0.0%
Missing (n),1

0,1
В городе Сочи. Олимпиада – праздник или стихийное...,670
Лондон 2012 — Олимпиада Антихриста,227
NewsOne Now Audio Podcast: Bishop E.W. Jackson Calls #BlackLivesMatter Is Movement “Disgraceful”,217
Other values (2365939),2972256

Value,Count,Frequency (%),Unnamed: 3
В городе Сочи. Олимпиада – праздник или стихийное...,670,0.0%,
Лондон 2012 — Олимпиада Антихриста,227,0.0%,
NewsOne Now Audio Podcast: Bishop E.W. Jackson Calls #BlackLivesMatter Is Movement “Disgraceful”,217,0.0%,
"...стадион, У нас своя олимпиада – За малышом бросок под стол...",213,0.0%,
Захарченко: ОБСЕ игнорирует гибель мирных жителей http://t.co/bvDdvi71yc http://t.co/7Wyyi5sYdh,197,0.0%,
"Вот в такую ситуацию можно попасть, если заказать первое попавшееся такси http://t.co/lLiWML7UCi http://t.co/EGzPOn15mO",144,0.0%,
"Олимпиада 2014. Андрей Малахов, Дмитрий Борисов и Кирилл",141,0.0%,
Honor scores #sports,130,0.0%,
TV/radio schedule #sports,119,0.0%,
SE Wis. road construction projects #Wisconsin,114,0.0%,

0,1
Distinct count,2490
Unique (%),0.1%
Missing (%),0.0%
Missing (n),4
Infinite (%),0.0%
Infinite (n),0

0,1
Mean,1.2961e+17
Minimum,34976000
Maximum,9.8125e+17
Zeros (%),0.0%

0,1
Minimum,34976000.0
5-th percentile,1647500000.0
Q1,1930700000.0
Median,2581800000.0
Q3,3254300000.0
95-th percentile,8.92e+17
Maximum,9.8125e+17
Range,9.8125e+17
Interquartile range,1323500000.0

0,1
Standard deviation,3.0363e+17
Coef of variation,2.3426
Kurtosis,1.8042
Mean,1.2961e+17
MAD,2.19e+17
Skewness,1.9366
Sum,3.8539e+23
Variance,9.2194e+34
Memory size,22.7 MiB

Value,Count,Frequency (%),Unnamed: 3
8.92e+17,65401,2.2%,
3272640600.0,45985,1.5%,
8.98e+17,45108,1.5%,
2943515140.0,44041,1.5%,
7.89e+17,43395,1.5%,
8.95e+17,38644,1.3%,
1679279490.0,35371,1.2%,
2601235821.0,33602,1.1%,
2951556370.0,32588,1.1%,
8.91e+17,31726,1.1%,

Value,Count,Frequency (%),Unnamed: 3
34976398.0,32,0.0%,
72581988.0,40,0.0%,
87588938.0,1993,0.1%,
97335028.0,232,0.0%,
131812518.0,1,0.0%,

Value,Count,Frequency (%),Unnamed: 3
9.44543e+17,1,0.0%,
9.44573e+17,1,0.0%,
9.44658e+17,35,0.0%,
9.68131e+17,1,0.0%,
9.81251e+17,96,0.0%,

0,1
Distinct count,66363
Unique (%),2.2%
Missing (%),0.0%
Missing (n),0
Infinite (%),0.0%
Infinite (n),0

0,1
Mean,7018.9
Minimum,-1
Maximum,251276
Zeros (%),0.5%

0,1
Minimum,-1
5-th percentile,50
Q1,320
Median,1274
Q3,10600
95-th percentile,26669
Maximum,251276
Range,251277
Interquartile range,10280

0,1
Standard deviation,14585
Coef of variation,2.0779
Kurtosis,58.41
Mean,7018.9
MAD,8538.4
Skewness,5.9359
Sum,20869833400
Variance,212710000
Memory size,22.7 MiB

Value,Count,Frequency (%),Unnamed: 3
0,13973,0.5%,
2,11148,0.4%,
1,9726,0.3%,
4,7525,0.3%,
5,7170,0.2%,
7,6167,0.2%,
3,5917,0.2%,
6,5897,0.2%,
8,5080,0.2%,
83,4555,0.2%,

Value,Count,Frequency (%),Unnamed: 3
-1,3,0.0%,
0,13973,0.5%,
1,9726,0.3%,
2,11148,0.4%,
3,5917,0.2%,

Value,Count,Frequency (%),Unnamed: 3
251265,1,0.0%,
251266,1,0.0%,
251267,1,0.0%,
251275,2,0.0%,
251276,1,0.0%,

0,1
Distinct count,28224
Unique (%),0.9%
Missing (%),0.0%
Missing (n),0
Infinite (%),0.0%
Infinite (n),0

0,1
Mean,3433.5
Minimum,-1
Maximum,76210
Zeros (%),2.0%

0,1
Minimum,-1
5-th percentile,3
Q1,327
Median,1499
Q3,4730
95-th percentile,13039
Maximum,76210
Range,76211
Interquartile range,4403

0,1
Standard deviation,5609.9
Coef of variation,1.6339
Kurtosis,48.907
Mean,3433.5
MAD,3433.1
Skewness,5.2837
Sum,10209139778
Variance,31471000
Memory size,22.7 MiB

Value,Count,Frequency (%),Unnamed: 3
2,61372,2.1%,
0,58730,2.0%,
3,38158,1.3%,
5,21242,0.7%,
68,12119,0.4%,
74,9948,0.3%,
65,7272,0.2%,
7,6816,0.2%,
120,5852,0.2%,
247,5719,0.2%,

Value,Count,Frequency (%),Unnamed: 3
-1,3,0.0%,
0,58730,2.0%,
1,4719,0.2%,
2,61372,2.1%,
3,38158,1.3%,

Value,Count,Frequency (%),Unnamed: 3
76206,13,0.0%,
76207,3,0.0%,
76208,2,0.0%,
76209,4,0.0%,
76210,4,0.0%,

0,1
Distinct count,906316
Unique (%),30.5%
Missing (%),0.0%
Missing (n),0

0,1
3/22/2016 17:35,1333
12/29/2016 4:01,455
3/22/2016 17:34,232
Other values (906313),2971351

Value,Count,Frequency (%),Unnamed: 3
3/22/2016 17:35,1333,0.0%,
12/29/2016 4:01,455,0.0%,
3/22/2016 17:34,232,0.0%,
6/20/2018 4:03,224,0.0%,
8/16/2017 1:32,216,0.0%,
8/16/2017 1:29,200,0.0%,
10/30/2016 14:57,178,0.0%,
8/15/2017 17:11,149,0.0%,
12/29/2016 4:30,144,0.0%,
8/16/2017 1:30,144,0.0%,

0,1
Distinct count,397232
Unique (%),13.4%
Missing (%),0.0%
Missing (n),0
Infinite (%),0.0%
Infinite (n),0

0,1
Mean,182210
Minimum,0
Maximum,397231
Zeros (%),0.0%

0,1
Minimum,0
5-th percentile,16518
Q1,88223
Median,181140
Q3,274060
95-th percentile,355140
Maximum,397231
Range,397231
Interquartile range,185840

0,1
Standard deviation,108200
Coef of variation,0.59384
Kurtosis,-1.1622
Mean,182210
MAD,93452
Skewness,0.05064
Sum,541765416541
Variance,11708000000
Memory size,22.7 MiB

Value,Count,Frequency (%),Unnamed: 3
9,2047,0.1%,
9,7377,0.2%,
9,27871,0.9%,
9,25822,0.9%,
9,31965,1.1%,
9,29916,1.0%,
9,19675,0.7%,
9,17626,0.6%,
9,23769,0.8%,
9,21720,0.7%,

Value,Count,Frequency (%),Unnamed: 3
1,390466,13.1%,
1,393299,13.2%,
1,394660,13.3%,
1,396709,13.3%,
1,391016,13.2%,

Value,Count,Frequency (%),Unnamed: 3
9,30841,1.0%,
9,24698,0.8%,
9,26747,0.9%,
9,12404,0.4%,
9,2047,0.1%,

0,1
Distinct count,56
Unique (%),0.0%
Missing (%),0.0%
Missing (n),0

0,1
English,2128963
Russian,624124
German,87171
Other values (53),133113

Value,Count,Frequency (%),Unnamed: 3
English,2128963,71.6%,
Russian,624124,21.0%,
German,87171,2.9%,
Ukrainian,39361,1.3%,
Italian,18254,0.6%,
Serbian,9615,0.3%,
Uzbek,9491,0.3%,
Bulgarian,9458,0.3%,
LANGUAGE UNDEFINED,8325,0.3%,
Arabic,7595,0.3%,

0,1
Distinct count,2
Unique (%),0.0%
Missing (%),0.0%
Missing (n),0

0,1
Mean,0.20787

0,1
0,2355286
1,618085

Value,Count,Frequency (%),Unnamed: 3
0,2355286,79.2%,
1,618085,20.8%,

0,1
Distinct count,3
Unique (%),0.0%
Missing (%),55.9%
Missing (n),1662425

0,1
RETWEET,1270702
QUOTE_TWEET,40244
(Missing),1662425

Value,Count,Frequency (%),Unnamed: 3
RETWEET,1270702,42.7%,
QUOTE_TWEET,40244,1.4%,
(Missing),1662425,55.9%,

0,1
Distinct count,896684
Unique (%),30.2%
Missing (%),0.0%
Missing (n),0

0,1
8/16/2017 1:29,202
8/16/2017 1:31,186
8/15/2017 17:01,149
Other values (896681),2972834

Value,Count,Frequency (%),Unnamed: 3
8/16/2017 1:29,202,0.0%,
8/16/2017 1:31,186,0.0%,
8/15/2017 17:01,149,0.0%,
8/16/2017 1:30,146,0.0%,
8/12/2017 19:11,144,0.0%,
8/16/2017 1:32,144,0.0%,
8/15/2017 17:09,143,0.0%,
8/15/2017 17:10,133,0.0%,
8/15/2017 17:11,128,0.0%,
8/16/2017 1:28,128,0.0%,

0,1
Distinct count,37
Unique (%),0.0%
Missing (%),0.3%
Missing (n),8843

0,1
United States,2055882
Unknown,572767
Azerbaijan,100755
Other values (33),235124

Value,Count,Frequency (%),Unnamed: 3
United States,2055882,69.1%,
Unknown,572767,19.3%,
Azerbaijan,100755,3.4%,
United Arab Emirates,74908,2.5%,
Russian Federation,37637,1.3%,
Belarus,29619,1.0%,
Germany,27192,0.9%,
United Kingdom,18062,0.6%,
Italy,13494,0.5%,
Iraq,11219,0.4%,

0,1
Distinct count,2
Unique (%),0.0%
Missing (%),0.0%
Missing (n),0

0,1
Mean,0.4409

0,1
0,1662425
1,1310946

Value,Count,Frequency (%),Unnamed: 3
0,1662425,55.9%,
1,1310946,44.1%,

0,1
Distinct count,97696
Unique (%),3.3%
Missing (%),0.0%
Missing (n),0
Infinite (%),0.0%
Infinite (n),0

0,1
Mean,10498
Minimum,-1
Maximum,166113
Zeros (%),0.0%

0,1
Minimum,-1
5-th percentile,308
Q1,1787
Median,4333
Q3,12341
95-th percentile,39104
Maximum,166113
Range,166114
Interquartile range,10554

0,1
Standard deviation,17687
Coef of variation,1.6849
Kurtosis,33.376
Mean,10498
MAD,10215
Skewness,4.9007
Sum,31213128677
Variance,312840000
Memory size,22.7 MiB

Value,Count,Frequency (%),Unnamed: 3
3,765,0.0%,
5,739,0.0%,
6,734,0.0%,
4,729,0.0%,
7,715,0.0%,
8,700,0.0%,
9,694,0.0%,
2,688,0.0%,
11,663,0.0%,
10,662,0.0%,

Value,Count,Frequency (%),Unnamed: 3
-1,3,0.0%,
1,643,0.0%,
2,688,0.0%,
3,765,0.0%,
4,729,0.0%,

Value,Count,Frequency (%),Unnamed: 3
166109,1,0.0%,
166110,1,0.0%,
166111,1,0.0%,
166112,1,0.0%,
166113,1,0.0%,

Unnamed: 0,external_author_id,author,content,region,language,publish_date,harvested_date,following,followers,updates,post_type,account_type,new_june_2018,retweet,account_category
0,9.06e+17,10_GOP,"""We have a sitting Democrat US Senator on tria...",Unknown,English,10/1/2017 19:58,10/1/2017 19:59,1052,9636,253,,Right,0,0,RightTroll
1,9.06e+17,10_GOP,Marshawn Lynch arrives to game in anti-Trump s...,Unknown,English,10/1/2017 22:43,10/1/2017 22:43,1054,9637,254,,Right,0,0,RightTroll
2,9.06e+17,10_GOP,Daughter of fallen Navy Sailor delivers powerf...,Unknown,English,10/1/2017 22:50,10/1/2017 22:51,1054,9637,255,RETWEET,Right,0,1,RightTroll
3,9.06e+17,10_GOP,JUST IN: President Trump dedicates Presidents ...,Unknown,English,10/1/2017 23:52,10/1/2017 23:52,1062,9642,256,,Right,0,0,RightTroll
4,9.06e+17,10_GOP,"19,000 RESPECTING our National Anthem! #StandF...",Unknown,English,10/1/2017 2:13,10/1/2017 2:13,1050,9645,246,RETWEET,Right,0,1,RightTroll


In [46]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 2973371 entries, 0 to 37554
Data columns (total 15 columns):
external_author_id    float64
author                object
content               object
region                object
language              object
publish_date          object
harvested_date        object
following             int64
followers             int64
updates               int64
post_type             object
account_type          object
new_june_2018         int64
retweet               int64
account_category      object
dtypes: float64(1), int64(5), object(9)
memory usage: 363.0+ MB


## Observations from Inspection / Pandas_Profiling ProfileReport

- **Language to Analyze is in `Content`:**
    - Actual tweet contents. 
 
- **Classification/Analysis Thoughts:**
    - **Variables should be considered in 2 ways:**
        - First, the tweet contents. 
            - Use NLP to engineer features to feed into deep learning.
                - Sentiment analysis, named-entity frequency/types, most-similar words. 
        - Second, the tweet metadata. 
        
### Thoughts on specific features:
- `language`
    - There are 56 unique languages. 
    - 2.4 million are English, 670 K are in Russian, etc.
    - Note: for metadata, analyzing if an account posts in more than 1 language may be a good predictor. 
- `followers`/`following`
    - **following** could be informative if goal is to predict if its a troll tweet.
    - **followers** should be used (with retweets) if predicting retweets based on content. 

### Questions to answer:
- [ ] Why are so many post_types missing? (55%?)
- [ ] How many tweets were written by a russian troll account?
    
### Scrubing to Perform
- **Recast Columns:**
    - [ ] `publish_date` to datetime. 
- **Columns to Discard:**
    - [ ] `harvested_date` (we care about publish_date, if anything, time-wise)
    - [ ] `language`: remove all non-english tweets and drop column
    - [ ] `new_june_2018`

In [47]:
# Drop non-english rows
df = df.loc[df.language=='English']
# df.info()

In [48]:
cols_to_drop = ['harvested_date','new_june_2018']#: remove all non-english tweets and drop column

for col in cols_to_drop:
    df.drop(col, axis=1, inplace=True)

df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 2128963 entries, 0 to 37431
Data columns (total 13 columns):
external_author_id    float64
author                object
content               object
region                object
language              object
publish_date          object
following             int64
followers             int64
updates               int64
post_type             object
account_type          object
retweet               int64
account_category      object
dtypes: float64(1), int64(4), object(8)
memory usage: 227.4+ MB


___
# Save/Load and Resume

In [None]:
save_or_load = input('Would you like to "save" or "load" dataframe?\n("save","load","no"):')

if save_or_load.lower()=='save':
    # Save csv
    df.to_csv('russian_troll_tweets_eng_only_date_pub_index.csv')
    
if save_or_load.lower()=='load':
    import bs_ds as bs
    from bs_ds.imports import *
    # Load csva
    df = pd.read_csv('russian_troll_tweets_eng_only_date_pub_index.csv')    

### Recasting Publish date as datetime column (date_published)

In [None]:
# Recast date_published as datetime and make index
df['date_published'] = pd.to_datetime(df['publish_date'])
df.set_index('date_published', inplace=True)
print('Changed index to datetime "date_published".')

In [None]:
print(df.columns)

In [10]:
# Convert publish_date to datetime
# df['date_published'] = pd.to_datetime(df.publish_date)
print(np.max(df.date_published), np.min(df.date_published))

2018-05-30 20:58:00 2012-02-06 20:24:00


In [17]:
df.head()

Unnamed: 0_level_0,author,content,region,language,publish_date,following,followers,updates,post_type,account_type,retweet,account_category
date_published,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1
2017-10-01 19:58:00,10_GOP,"""We have a sitting Democrat US Senator on tria...",Unknown,English,10/1/2017 19:58,1052,9636,253,,Right,0,RightTroll
2017-10-01 22:43:00,10_GOP,Marshawn Lynch arrives to game in anti-Trump s...,Unknown,English,10/1/2017 22:43,1054,9637,254,,Right,0,RightTroll
2017-10-01 22:50:00,10_GOP,Daughter of fallen Navy Sailor delivers powerf...,Unknown,English,10/1/2017 22:50,1054,9637,255,RETWEET,Right,1,RightTroll
2017-10-01 23:52:00,10_GOP,JUST IN: President Trump dedicates Presidents ...,Unknown,English,10/1/2017 23:52,1062,9642,256,,Right,0,RightTroll
2017-10-01 02:13:00,10_GOP,"19,000 RESPECTING our National Anthem! #StandF...",Unknown,English,10/1/2017 2:13,1050,9645,246,RETWEET,Right,1,RightTroll


In [18]:
# Drop un-needed columns
cols_to_drop = ['publish_date','language']
for col in cols_to_drop:

    df.drop(col, axis=1, inplace=True)
    print(f'Dropped {col}.')


# Recast categorical columns
cols_to_cats = ['region','post_type','account_type','account_category']
for col in cols_to_cats:

    df[col] = df[col].astype('category')
    print(f'Converted {col} to category.')


# Drop problematic nan in 'contet'
df.dropna(subset=['content'],inplace=True) # Dropping the 1 null value 

df.head()

Dropped publish_date.
Dropped language.
Converted region to category.
Converted post_type to category.
Converted account_type to category.
Converted account_category to category.


Unnamed: 0_level_0,author,content,region,following,followers,updates,post_type,account_type,retweet,account_category
date_published,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
2017-10-01 19:58:00,10_GOP,"""We have a sitting Democrat US Senator on tria...",Unknown,1052,9636,253,,Right,0,RightTroll
2017-10-01 22:43:00,10_GOP,Marshawn Lynch arrives to game in anti-Trump s...,Unknown,1054,9637,254,,Right,0,RightTroll
2017-10-01 22:50:00,10_GOP,Daughter of fallen Navy Sailor delivers powerf...,Unknown,1054,9637,255,RETWEET,Right,1,RightTroll
2017-10-01 23:52:00,10_GOP,JUST IN: President Trump dedicates Presidents ...,Unknown,1062,9642,256,,Right,0,RightTroll
2017-10-01 02:13:00,10_GOP,"19,000 RESPECTING our National Anthem! #StandF...",Unknown,1050,9645,246,RETWEET,Right,1,RightTroll


In [19]:
df.info()

<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 2420533 entries, 2017-10-01 19:58:00 to 2015-08-13 11:19:00
Data columns (total 10 columns):
author              object
content             object
region              category
following           int64
followers           int64
updates             int64
post_type           category
account_type        category
retweet             int64
account_category    category
dtypes: category(4), int64(4), object(2)
memory usage: 138.5+ MB


# Thoughts on My Search Strategy

**My Twitter API Link:**<br>
https://api.twitter.com/1.1/tweets/search/fullarchive/search.json


**Inspect Data to get search parameters:**
- [X] Get the date range for the English tweets in the original dataset<br>
    - **Tweet date range:**
        - **2012-02-06** to **2018-05-30**

- [X] Get a list of the hash tags (and their frequencies from the dataframe

**Determine most feasible and balanced well of extracting control tweets**
- [ ] How many of each tag / @'s should I try to exctract?
- [ ] what are the limitations of the API that will be a road block to getting as many tweets as desired?

In [20]:
# Inspect Data to get search parameters:
print(f'Tweet date range:\n {min(df.index)} to {max(df.index)}')
print(f'\nTotal days: {max(df.index)-min(df.index)}')

Tweet date range:
 2012-02-06 20:24:00 to 2018-05-30 20:58:00

Total days: 2305 days 00:34:00


## Determining Hashtags & @'s to search for

- Use regular expressions to extract the hashtags #words and @handles.
- Use the top X many tags as search terms for twitter API
    - There are _1,678,170 unique hashtags_ and _1,165,744 unique @'s_

### def get_tags_ats

In [21]:
# Define get_tags_ats to accept a list of text entries and return all found tags and ats as 2 series/lists
def get_tags_ats(text_to_search,exp_tag = r'(#\w*)',exp_at = r'(@\w*)', output='series',show_counts=False):
    """Accepts a list of text entries to search, and a regex for tags, and a regex for @'s.
    Joins all entries in the list of text and then re.findsall() for both expressions.
    Returns a series of found_tags and a series of found_ats.'"""
    import re
    
    # Create a single long joined-list of strings
    text_to_search_combined = ' '.join(text_to_search)
        
    # print(len(text_to_search_combined), len(text_to_search_list))
    found_tags = re.findall(exp_tag, text_to_search_combined)
    found_ats = re.findall(exp_at, text_to_search_combined)
    
    if output.lower() == 'series':
        found_tags = pd.Series(found_tags, name='tags')
        found_ats = pd.Series(found_ats, name='ats')
        
        if show_counts==True:
            print(f'\t{found_tags.name}:\n{tweet_tags.value_counts()} \n\n\t{found_ats.name}:\n{tweet_ats.value_counts()}')
                
    if (output.lower() != 'series') & (show_counts==True):
        raise Exception('output must be set to "series" in order to show_counts')
                       
    return found_tags, found_ats

In [22]:
# Need to get a list of hash tags.
text_to_search_list = []

for i in range(len(df)):    
    tweet_contents =df['content'].iloc[i]
    text_to_search_list.append(tweet_contents)

text_to_search_list[:2]

['"We have a sitting Democrat US Senator on trial for corruption and you\'ve barely heard a peep from the mainstream media." ~ @nedryun https://t.co/gh6g0D1oiC',
 'Marshawn Lynch arrives to game in anti-Trump shirt. Judging by his sagging pants the shirt should say Lynch vs. belt https://t.co/mLH1i30LZZ']

In [23]:
# Get all tweet tags and @'s from text_to_search_list
tweet_tags, tweet_ats = get_tags_ats(text_to_search_list, show_counts=False)

print(f"There were {len(tweet_tags)} unique hashtags and {len(tweet_ats)} unique @'s\n")

# Create a dataframe with top_tags
df_top_tags = pd.DataFrame(tweet_tags.value_counts()[:40])#,'\n')
df_top_tags['% Total'] = (df_top_tags['tags']/len(tweet_tags)*100)

# Create a dataframe with top_ats
df_top_ats = pd.DataFrame(tweet_ats.value_counts()[:40])
df_top_ats['% Total'] = (df_top_ats['ats']/len(tweet_ats)*100)

# Display top tags and ats
# bs.display_side_by_side(df_top_tags,df_top_ats)

There were 1678170 unique hashtags and 1165744 unique @'s



### Notes on Top Tags and Ats:


In [24]:
# Choose list of top tags to use in search
list_top_30_tags = df_top_tags.index[:30]
list_top_30_tags

Index(['#news', '#sports', '#politics', '#world', '#local', '#MAGA',
       '#BlackLivesMatter', '#TopNews', '#tcot', '#PJNET', '#health',
       '#business', '#tech', '#entertainment', '#top', '#Cleveland', '#crime',
       '#TopVideo', '#Trump', '#NowPlaying', '#amb', '#environment', '#ISIS',
       '#breaking', '#mar', '#WakeUpAmerica', '#Miami', '#2A', '#GOPDebate',
       '#topl'],
      dtype='object')

In [25]:
# Choose list of top tags to use in search
list_top_30_ats = df_top_ats.index[:30]
list_top_30_ats

Index(['@realDonaldTrump', '@midnight', '@POTUS', '@HillaryClinton',
       '@YouTube', '@', '@CNN', '@FoxNews', '@TalibKweli', '@WarfareWW',
       '@GiselleEvns', '@WorldOfHashtags', '@deray', '@nytimes', '@josephjett',
       '@CNNPolitics', '@GOP', '@seanhannity', '@BreitbartNews',
       '@BarackObama', '@HashtagRoundup', '@tedcruz', '@washingtonpost',
       '@docrocktex26', '@ShaunKing', '@BernieSanders', '@VanJones68',
       '@mashable', '@Jenn_Abrams', '@SpeakerRyan'],
      dtype='object')

- The most common tags include some very generic categories that may not be helpful in extracting control tweets.
    - ~~Exclude: '#news','#sports','#politics','#world','#local','#TopNews','#health','#business','#tech',~~
    - On second thought, this is entirely appropriate, since these tags would be what appears in the wild.
    - Additionally, using a larger number of them (like 30, starts to provide more targeted hashtags.<br><br>
  
- **The most common @'s are much more revealing and helpful in narrowing the focus of the results.**

___

# Using the Twitter Search API to Extract Control Tweets

- [x] Required API key are saved in the Main folder in which this repo is saved. 
- [x] Check the [Premium account docs for search syntax](https://developer.twitter.com/en/docs/tweets/search/guides/premium-operators.html)
- [x] [Check this article for using Tweepy for most efficient twitter api extraction](https://bhaskarvk.github.io/2015/01/how-to-use-twitters-search-rest-api-most-effectively./)

**LINK TO PREMIUM SEARCH API GUIDE**<br>
https://developer.twitter.com/en/docs/tweets/search/api-reference/premium-search

**Available search operators**
- Premium search API supports rules with up to 1,024 characters. The Search Tweets APIs support the premium operators listed below. See our Premium operators guide for more details.

- The base URI for the premium search API is https://api.twitter.com/1.1/tweets/search/.

**Matching on Tweet contents:**
- keyword , "quoted phrase" , # , @, url , lang


## Using tweepy to access twitter API

- [Helpful tutorial on _most efficient_ way to access twitter API](https://bhaskarvk.github.io/2015/01/how-to-use-twitters-search-rest-api-most-effectively./)

### def connect_twitter_api, def search_twitter_api

In [26]:
# Initialzie Tweepy with Authorization Keys    
def connect_twitter_api(api_key, api_secret_key):
    import tweepy, sys
    auth = tweepy.AppAuthHandler(api_key, api_secret_key)
    api = tweepy.API(auth, wait_on_rate_limit=True, wait_on_rate_limit_notify=True)

    if (not api):
        print("Can't authenticate.")
        sys.exit(-1)
    return api

In [27]:
def search_twitter_api(api_object, searchQuery, maxTweets, fName, tweetsPerQry=100, max_id=0, sinceId=None):
    """Take an authenticated tweepy api_object, a search queary, max# of tweets to retreive, a desintation filename.
    Uses tweept.api.search for the searchQuery until maxTweets is reached, saved harvest tweets to fName."""
    import sys, jsonpickle, os
    api = api_object
    tweetCount = 0
    print(f'Downloading max{maxTweets} for {searchQuery}...')
    with open(fName, 'a+') as f:
        while tweetCount < maxTweets:

            try:
                if (max_id <=0):
                    if (not sinceId):
                        new_tweets = api.search(q=searchQuery, count=tweetsPerQry, tweet_mode='extended')
                    else:
                        new_tweets = api.search(q=searchQuery, count=tweetsPerQry, since_id=sinceId, tweet_mode='extended')

                else:
                    if (not sinceId):
                        new_tweets = api.search(q=searchQuery, count=tweetsPerQry, max_id=str(max_id-1), tweet_mode='extended')
                    else:
                        new_tweets = api.search(q=searchQuery, count=tweetsPerQry, max_id=str(max_id-1),since_id=sinceId, tweet_mode='extended')

                if not new_tweets:
                    print('No more tweets found')
                    break

                for tweet in new_tweets:
                    f.write(jsonpickle.encode(tweet._json, unpicklable=False)+'\n')

                tweetCount+=len(new_tweets)

                print("Downloaded {0} tweets".format(tweetCount))
                max_id = new_tweets[-1].id

            except tweepy.TweepError as e:
                # Just exit if any error
                print("some error : " + str(e))
                break
    print ("Downloaded {0} tweets, Saved to {1}\n".format(tweetCount, fName))



## Connect to Twitter and Harvest Tweets

### Making lists of tags and ats to query

In [28]:
# df_top_ats.ats[:20], df_top_tags.tags[:20]

In [29]:
# Figure out the # of each @ and each # that i want ot query, then make a query_dict to feed into the cell below
query_ats = tuple(zip(df_top_ats.index, df_top_ats['ats']))
query_tags = tuple(zip(df_top_tags.index, df_top_tags['tags']))

# Calculate how many tweets are represented by the top 30 tags and top 30 @'s 
sum_top_tweet_tags = df_top_tags['tags'].sum()
sum_top_tweet_ats = df_top_ats['ats'].sum()
print(f"Sum of top tags = {sum_top_tweet_tags}\nSum of top @'s = {sum_top_tweet_ats}")

Sum of top tags = 525668
Sum of top @'s = 80782


### Connecting to twitter api and searching for lists of queries

In [30]:
# Import API keys from text files (so not displayed here and not in repo)
with open('../consumer_API_key.txt','r') as f:
    api_key =  f.read()
with open('../consumer_API_secret_key.txt','r') as f:
    api_secret_key  = f.read()

In [31]:
api = connect_twitter_api(api_key,api_secret_key)

In [32]:
# Extract tweets for top @'s, while matching the distribution of top @'s

final_query_list = query_ats[:3]
filename = 'tweets_for_top3_ats.txt'

for q in final_query_list:
    searchQuery = q[0]
    maxTweets = q[1]
    print(f'Query={searchQuery}, max={maxTweets}')
    search_twitter_api(api, searchQuery, maxTweets, fName=filename)

Query=@realDonaldTrump, max=14999
Downloading max14999 for @realDonaldTrump...
Downloaded 76 tweets
Downloaded 148 tweets
Downloaded 217 tweets
Downloaded 287 tweets
Downloaded 363 tweets
Downloaded 438 tweets
Downloaded 510 tweets
Downloaded 587 tweets
Downloaded 668 tweets
Downloaded 740 tweets
Downloaded 822 tweets
Downloaded 892 tweets
Downloaded 968 tweets
Downloaded 1038 tweets
Downloaded 1104 tweets
Downloaded 1178 tweets
Downloaded 1261 tweets
Downloaded 1333 tweets
Downloaded 1407 tweets
Downloaded 1481 tweets
Downloaded 1552 tweets
Downloaded 1619 tweets
Downloaded 1692 tweets
Downloaded 1776 tweets
Downloaded 1849 tweets
Downloaded 1926 tweets
Downloaded 1999 tweets
Downloaded 2071 tweets
Downloaded 2143 tweets
Downloaded 2210 tweets
Downloaded 2286 tweets
Downloaded 2362 tweets
Downloaded 2430 tweets
Downloaded 2506 tweets
Downloaded 2589 tweets
Downloaded 2656 tweets
Downloaded 2730 tweets
Downloaded 2800 tweets
Downloaded 2875 tweets
Downloaded 2951 tweets
Downloaded 3021

In [33]:
# Extract tweets for top #
final_query_list = query_tags[:3]
filename = 'tweets_for_top3_tags.txt'

for q in final_query_list:
    searchQuery = q[0]
    maxTweets = q[1]
    print(f'Query={searchQuery}, max={maxTweets}')
    search_twitter_api(api, searchQuery, maxTweets, fName=filename)

Query=#news, max=130268
Downloading max130268 for #news...
Downloaded 58 tweets
Downloaded 136 tweets
Downloaded 222 tweets
Downloaded 293 tweets
Downloaded 368 tweets
Downloaded 453 tweets
Downloaded 525 tweets
Downloaded 598 tweets
Downloaded 665 tweets
Downloaded 743 tweets
Downloaded 834 tweets
Downloaded 893 tweets
Downloaded 934 tweets
Downloaded 1014 tweets
Downloaded 1094 tweets
Downloaded 1162 tweets
Downloaded 1231 tweets
Downloaded 1304 tweets
Downloaded 1364 tweets
Downloaded 1440 tweets
Downloaded 1509 tweets
Downloaded 1590 tweets
Downloaded 1679 tweets
Downloaded 1758 tweets
Downloaded 1831 tweets
Downloaded 1917 tweets
Downloaded 2011 tweets
Downloaded 2084 tweets
Downloaded 2168 tweets
Downloaded 2258 tweets
Downloaded 2334 tweets
Downloaded 2401 tweets
Downloaded 2477 tweets
Downloaded 2555 tweets
Downloaded 2638 tweets
Downloaded 2713 tweets
Downloaded 2792 tweets
Downloaded 2863 tweets
Downloaded 2946 tweets
Downloaded 3032 tweets
Downloaded 3120 tweets
Downloaded 3

Rate limit reached. Sleeping for: 387


Downloaded 12553 tweets
Downloaded 12649 tweets
Downloaded 12716 tweets
Downloaded 12780 tweets
Downloaded 12870 tweets
Downloaded 12945 tweets
Downloaded 13030 tweets
Downloaded 13118 tweets
Downloaded 13209 tweets
Downloaded 13306 tweets
Downloaded 13393 tweets
Downloaded 13487 tweets
Downloaded 13583 tweets
Downloaded 13676 tweets
Downloaded 13769 tweets
Downloaded 13846 tweets
Downloaded 13925 tweets
Downloaded 14008 tweets
Downloaded 14098 tweets
Downloaded 14173 tweets
Downloaded 14243 tweets
Downloaded 14329 tweets
Downloaded 14407 tweets
Downloaded 14472 tweets
Downloaded 14538 tweets
Downloaded 14605 tweets
Downloaded 14682 tweets
Downloaded 14755 tweets
Downloaded 14822 tweets
Downloaded 14910 tweets
Downloaded 14997 tweets
Downloaded 15081 tweets
Downloaded 15160 tweets
Downloaded 15239 tweets
Downloaded 15306 tweets
Downloaded 15379 tweets
Downloaded 15461 tweets
Downloaded 15524 tweets
Downloaded 15604 tweets
Downloaded 15698 tweets
Downloaded 15784 tweets
Downloaded 15880

Downloaded 39878 tweets
Downloaded 39949 tweets
Downloaded 40023 tweets
Downloaded 40099 tweets
Downloaded 40156 tweets
Downloaded 40244 tweets
Downloaded 40330 tweets
Downloaded 40413 tweets
Downloaded 40487 tweets
Downloaded 40553 tweets
Downloaded 40634 tweets
Downloaded 40716 tweets
Downloaded 40791 tweets
Downloaded 40868 tweets
Downloaded 40950 tweets
Downloaded 41037 tweets
Downloaded 41119 tweets
Downloaded 41203 tweets
Downloaded 41288 tweets
Downloaded 41370 tweets
Downloaded 41448 tweets
Downloaded 41527 tweets
Downloaded 41608 tweets
Downloaded 41685 tweets
Downloaded 41754 tweets
Downloaded 41812 tweets
Downloaded 41888 tweets
Downloaded 41969 tweets
Downloaded 42060 tweets
Downloaded 42139 tweets
Downloaded 42208 tweets
Downloaded 42301 tweets
Downloaded 42392 tweets
Downloaded 42476 tweets
Downloaded 42561 tweets
Downloaded 42637 tweets
Downloaded 42714 tweets
Downloaded 42789 tweets
Downloaded 42870 tweets
Downloaded 42944 tweets
Downloaded 43011 tweets
Downloaded 43082

Rate limit reached. Sleeping for: 203


Downloaded 48520 tweets
Downloaded 48601 tweets
Downloaded 48693 tweets
Downloaded 48769 tweets
Downloaded 48860 tweets
Downloaded 48946 tweets
Downloaded 49031 tweets
Downloaded 49094 tweets
Downloaded 49168 tweets
Downloaded 49247 tweets
Downloaded 49313 tweets
Downloaded 49375 tweets
Downloaded 49457 tweets
Downloaded 49533 tweets
Downloaded 49620 tweets
Downloaded 49695 tweets
Downloaded 49776 tweets
Downloaded 49868 tweets
Downloaded 49948 tweets
Downloaded 50024 tweets
Downloaded 50090 tweets
Downloaded 50147 tweets
Downloaded 50224 tweets
Downloaded 50283 tweets
Downloaded 50366 tweets
Downloaded 50447 tweets
Downloaded 50528 tweets
Downloaded 50620 tweets
Downloaded 50706 tweets
Downloaded 50792 tweets
Downloaded 50863 tweets
Downloaded 50952 tweets
Downloaded 51035 tweets
Downloaded 51104 tweets
Downloaded 51178 tweets
Downloaded 51258 tweets
Downloaded 51324 tweets
Downloaded 51396 tweets
Downloaded 51471 tweets
Downloaded 51552 tweets
Downloaded 51637 tweets
Downloaded 51714

Downloaded 74889 tweets
Downloaded 74971 tweets
Downloaded 75045 tweets
Downloaded 75129 tweets
Downloaded 75201 tweets
Downloaded 75276 tweets
Downloaded 75351 tweets
Downloaded 75426 tweets
Downloaded 75516 tweets
Downloaded 75600 tweets
Downloaded 75677 tweets
Downloaded 75747 tweets
Downloaded 75831 tweets
Downloaded 75917 tweets
Downloaded 75990 tweets
Downloaded 76069 tweets
Downloaded 76143 tweets
Downloaded 76232 tweets
Downloaded 76306 tweets
Downloaded 76387 tweets
Downloaded 76466 tweets
Downloaded 76555 tweets
Downloaded 76632 tweets
Downloaded 76720 tweets
Downloaded 76798 tweets
Downloaded 76882 tweets
Downloaded 76972 tweets
Downloaded 77064 tweets
Downloaded 77148 tweets
Downloaded 77238 tweets
Downloaded 77327 tweets
Downloaded 77427 tweets
Downloaded 77509 tweets
Downloaded 77596 tweets
Downloaded 77682 tweets
Downloaded 77760 tweets
Downloaded 77837 tweets
Downloaded 77911 tweets
Downloaded 77995 tweets
Downloaded 78056 tweets
Downloaded 78133 tweets
Downloaded 78220

Rate limit reached. Sleeping for: 157


Downloaded 83282 tweets
Downloaded 83363 tweets
Downloaded 83440 tweets
Downloaded 83520 tweets
Downloaded 83592 tweets
Downloaded 83687 tweets
Downloaded 83771 tweets
Downloaded 83855 tweets
Downloaded 83928 tweets
Downloaded 84008 tweets
Downloaded 84097 tweets
Downloaded 84177 tweets
Downloaded 84265 tweets
Downloaded 84344 tweets
Downloaded 84432 tweets
Downloaded 84526 tweets
Downloaded 84604 tweets
Downloaded 84685 tweets
Downloaded 84771 tweets
Downloaded 84860 tweets
Downloaded 84921 tweets
Downloaded 85003 tweets
Downloaded 85089 tweets
Downloaded 85173 tweets
Downloaded 85264 tweets
Downloaded 85351 tweets
Downloaded 85431 tweets
Downloaded 85516 tweets
Downloaded 85601 tweets
Downloaded 85675 tweets
Downloaded 85753 tweets
Downloaded 85843 tweets
Downloaded 85931 tweets
Downloaded 86012 tweets
Downloaded 86098 tweets
Downloaded 86172 tweets
Downloaded 86253 tweets
Downloaded 86339 tweets
Downloaded 86427 tweets
Downloaded 86510 tweets
Downloaded 86595 tweets
Downloaded 86686

NameError: name 'tweepy' is not defined

___

In [36]:
df.head()

Unnamed: 0_level_0,author,content,region,following,followers,updates,post_type,account_type,retweet,account_category
date_published,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
2017-10-01 19:58:00,10_GOP,"""We have a sitting Democrat US Senator on tria...",Unknown,1052,9636,253,,Right,0,RightTroll
2017-10-01 22:43:00,10_GOP,Marshawn Lynch arrives to game in anti-Trump s...,Unknown,1054,9637,254,,Right,0,RightTroll
2017-10-01 22:50:00,10_GOP,Daughter of fallen Navy Sailor delivers powerf...,Unknown,1054,9637,255,RETWEET,Right,1,RightTroll
2017-10-01 23:52:00,10_GOP,JUST IN: President Trump dedicates Presidents ...,Unknown,1062,9642,256,,Right,0,RightTroll
2017-10-01 02:13:00,10_GOP,"19,000 RESPECTING our National Anthem! #StandF...",Unknown,1050,9645,246,RETWEET,Right,1,RightTroll


### Notes on Making New Extracted Tweets Match Russian Troll Tweet Database

- Columns to be renamed/reformatted to match troll tweets:
    - created_at -> 'date_published'-> index
    - full_text -> 'content'
    - df['user'].
        - .['followers_count'] -> 'following'
        - .['followers_count'] -> 'followers'
        - .['screen_name'] -> 
        - .['id'] ->
- Columns missing from original troll tweets (to be removed).
    -coordinates, favorited, favorite_count, display_text_range, withheld_in_countries
    

In [34]:
df_tweets_ats = pd.read_json('tweets_for_top3_ats.txt', lines=True)

In [37]:
df_tweets_ats.user[0]

{'contributors_enabled': False,
 'created_at': 'Sat Apr 06 22:35:56 +0000 2019',
 'default_profile': True,
 'default_profile_image': True,
 'description': 'Trump Hater',
 'entities': {'description': {'urls': []}},
 'favourites_count': 0,
 'follow_request_sent': None,
 'followers_count': 0,
 'following': None,
 'friends_count': 0,
 'geo_enabled': False,
 'has_extended_profile': False,
 'id': 1114658021785919488,
 'id_str': '1114658021785919488',
 'is_translation_enabled': False,
 'is_translator': False,
 'lang': 'en',
 'listed_count': 0,
 'location': '',
 'name': 'Jimmy',
 'notifications': None,
 'profile_background_color': 'F5F8FA',
 'profile_background_image_url': None,
 'profile_background_image_url_https': None,
 'profile_background_tile': False,
 'profile_image_url': 'http://abs.twimg.com/sticky/default_profile_images/default_profile_normal.png',
 'profile_image_url_https': 'https://abs.twimg.com/sticky/default_profile_images/default_profile_normal.png',
 'profile_link_color': '1DA

In [35]:
df_tweets_ats.head()

Unnamed: 0,contributors,coordinates,created_at,display_text_range,entities,extended_entities,favorite_count,favorited,full_text,geo,...,quoted_status,quoted_status_id,quoted_status_id_str,retweet_count,retweeted,retweeted_status,source,truncated,user,withheld_in_countries
0,,,2019-06-02 18:34:59,"[0, 202]","{'hashtags': [{'indices': [48, 66], 'text': 'd...",,0,False,@realDonaldTrump it’s perfectly reasonable tha...,,...,,,,0,False,,"<a href=""http://twitter.com/download/iphone"" r...",False,"{'contributors_enabled': False, 'created_at': ...",
1,,,2019-06-02 18:34:59,"[0, 139]","{'hashtags': [], 'symbols': [], 'urls': [], 'u...",,0,False,RT @BelkissObadia: BREAKING NEWS: \n\n@realDon...,,...,,,,950,False,"{'contributors': None, 'coordinates': None, 'c...","<a href=""http://twitter.com/download/android"" ...",False,"{'contributors_enabled': False, 'created_at': ...",
2,,,2019-06-02 18:34:59,"[17, 207]","{'hashtags': [], 'symbols': [], 'urls': [], 'u...",,0,False,@realDonaldTrump I thought you were supposed t...,,...,,,,0,False,,"<a href=""http://twitter.com/download/android"" ...",False,"{'contributors_enabled': False, 'created_at': ...",
3,,,2019-06-02 18:34:59,"[0, 139]","{'hashtags': [], 'symbols': [], 'urls': [], 'u...",,0,False,RT @realDonaldTrump: Mexico is sending a big d...,,...,,,,3128,False,"{'contributors': None, 'coordinates': None, 'c...","<a href=""http://twitter.com/#!/download/ipad"" ...",False,"{'contributors_enabled': False, 'created_at': ...",
4,,,2019-06-02 18:34:59,"[0, 139]","{'hashtags': [], 'symbols': [], 'urls': [], 'u...",,0,False,RT @realDonaldTrump: Mexico is sending a big d...,,...,,,,3128,False,"{'contributors': None, 'coordinates': None, 'c...","<a href=""http://twitter.com/download/android"" ...",False,"{'contributors_enabled': False, 'created_at': ...",


In [None]:
df_tweets_ats['date_published'] = pd.to_datetime(df_tweets['created_at'])

In [None]:
df_tweets_ats['date_published'].min(), df_tweets['date_published'].max()