# 3 - Million Russian Troll Tweets - Part I
- James M Irving, Ph.D.
- Mod 4 Project
- Flatiron Full Time Data Science Bootcamp - 02/2019 Cohort

## GOAL: Using Twitter API to Extract Control Tweets

- *IF I can get a control dataset* of non-Troll tweets from same time period with similar hashtags:*
    - Use NLP to predict of a tweet is from an authentic user or a Russian troll.
- *If no control tweets to compare to*
    - Use NLP to predict how many retweets a Troll tweet will get.
    - Consider both raw # of retweets, as well as a normalized # of retweets/# of followers.
        - The latter would give better indication of language's effect on propagation. 
        

# INSPECTING TROLL TWEETS FROM KAGGLE

In [1]:
!pip install -U bs_ds

Requirement already up-to-date: bs_ds in /anaconda3/envs/learn-env/lib/python3.6/site-packages (0.10.0)


Collecting bleach==1.5.0
  Using cached https://files.pythonhosted.org/packages/33/70/86c5fec937ea4964184d4d6c4f0b9551564f821e1c3575907639036d9b90/bleach-1.5.0-py2.py3-none-any.whl


[31mERROR: readme-renderer 24.0 has requirement bleach>=2.1.0, but you'll have bleach 1.5.0 which is incompatible.[0m
Installing collected packages: bleach
  Found existing installation: bleach 3.1.0
    Uninstalling bleach-3.1.0:
      Successfully uninstalled bleach-3.1.0
Successfully installed bleach-3.1.0


In [2]:
import bs_ds as bs
from bs_ds.imports import *

bs_ds  v0.10.0 loaded.  Read the docs: https://bs-ds.readthedocs.io/en/latest/index.html
> For convenient loading of standard modules use: `from bs_ds.imports import *`



Package,Handle,Description
bs_ds,bs,Custom data science bootcamp student package
matplotlib,mpl,Matplotlib's base OOP module with formatting artists
matplotlib.pyplot,plt,Matplotlib's matlab-like plotting module
numpy,np,scientific computing with Python
pandas,pd,High performance data structures and tools
seaborn,sns,High-level data visualization library based on matplotlib


In [3]:
import os
root_dir = 'russian-troll-tweets/'
# os.listdir('russian-troll-tweets/')
filelist = [os.path.join(root_dir,file) for file in os.listdir(root_dir) if file.endswith('.csv')]
filelist

[]

In [None]:
# Vertically concatenate 
df = pd.DataFrame()
for file in filelist:
    df_new = pd.read_csv(file)
    df = pd.concat([df,df_new], axis=0)
df.info()

In [None]:
df.head(2)

## Dataset Features:
- Kaggle Dataset published by FiveThirtyEight
    - https://www.kaggle.com/fivethirtyeight/russian-troll-tweets/downloads/russian-troll-tweets.zip/2
<br>    
- Data is split into 9 .csv files
    - 'IRAhandle_tweets_1.csv' to 9

- **Variables:**
    - ~~`external_author_id` | An author account ID from Twitter~~
    - `author` | The handle sending the tweet
    - `content` | The text of the tweet
    - `region` | A region classification, as [determined by Social Studio](https://help.salesforce.com/articleView?   id=000199367&type=1)
    - `language` | The language of the tweet
    - `publish_date` | The date and time the tweet was sent
    - ~~`harvested_date` | The date and time the tweet was collected by Social Studio~~
    - `following` | The number of accounts the handle was following at the time of the tweet
    - `followers` | The number of followers the handle had at the time of the tweet
    - `updates` | The number of “update actions” on the account that authored the tweet, including tweets, retweets and likes
    - `post_type` | Indicates if the tweet was a retweet or a quote-tweet *[Whats a quote-tweet?]*
    - `account_type` | Specific account theme, as coded by Linvill and Warren
    - `retweet` | A binary indicator of whether or not the tweet is a retweet [?]
    - `account_category` | General account theme, as coded by Linvill and Warren
    - `new_june_2018` | A binary indicator of whether the handle was newly listed in June 2018
    
### **Classification of account_type**
Taken from: [rcmediafreedom.eu summary](https://www.rcmediafreedom.eu/Publications/Academic-sources/Troll-Factories-The-Internet-Research-Agency-and-State-Sponsored-Agenda-Building)

>- **They identified five categories of IRA-associated Twitter accounts, each with unique patterns of behaviors:**
    - **Right Troll**, spreading nativist and right-leaning populist messages. It supported the candidacy and Presidency of Donald Trump and denigrated the Democratic Party. It often sent divisive messages about mainstream and moderate Republicans.
    - **Left Troll**, sending socially liberal messages and discussing gender, sexual, religious, and -especially- racial identity. Many tweets seemed intentionally divisive, attacking mainstream Democratic politicians, particularly Hillary Clinton, while supporting Bernie Sanders prior to the election.
    - **News Feed**, overwhelmingly presenting themselves as U.S. local news aggregators, linking to legitimate regional news sources and tweeting about issues of local interest.
    - **Hashtag Gamer**, dedicated almost exclusively to playing hashtag games.
    - **Fearmonger**: spreading a hoax about poisoned turkeys near the 2015 Thanksgiving holiday.

>The different types of account were used differently and their efforts were conducted systematically, with different allocation when faced with different political circumstances or shifting goals. E.g.: there was a spike of activity by right and left troll accounts before the publication of John Podesta's emails by WikiLeaks. According to the authors, this activity can be characterised as “industrialized political warfare”.

___

# SCRUB / EDA

In [None]:
# from pandas_profiling import ProfileReport
# ProfileReport(df)

## Observations from Inspection / Pandas_Profiling ProfileReport

- **Language to Analyze is in `Content`:**
    - Actual tweet contents. 
 
- **Classification/Analysis Thoughts:**
    - **Variables should be considered in 2 ways:**
        - First, the tweet contents. 
            - Use NLP to engineer features to feed into deep learning.
                - Sentiment analysis, named-entity frequency/types, most-similar words. 
        - Second, the tweet metadata. 
        
### Thoughts on specific features:
- `language`
    - There are 56 unique languages. 
    - 2.4 million are English, 670 K are in Russian, etc.

### Questions to answer:
- [x] Why are so many post_types missing? (55%?)
    - Because they were added 'new_june_2018' and were not classified by the original scientists. 
- [x] How many tweets were written by a russian troll account?
    - After removing retweets, there are 1,272,848 original tweets. 
    
### Scrubing to Perform
- **Recast Columns:**
    - [ ] `publish_date` to datetime. 
- **Columns to Discard:**
    - [ ] `harvested_date` (we care about publish_date, if anything, time-wise)
    - [ ] `language`: remove all non-english tweets and drop column
    - [ ] `new_june_2018`

### Reducing Targeted Tweets to Language=English and Retweet=0 Only

- Since the goal is to use NLP to detect which tweets came from Russian trolls, we will only analyze the tweets that were originally created by a known Russian troll account

In [None]:
# Drop non-english rows
df = df.loc[df.language=='English']
df = df.loc[df.retweet==0]
df.info()

In [None]:
# Drop harvested_date and new_june_2018
cols_to_drop = ['harvested_date','new_june_2018']#: remove all non-english tweets and drop column

for col in cols_to_drop:
    df.drop(col, axis=1, inplace=True)

df.info()

___
## Save/Load and Resume

In [None]:
# save_or_load = input('Would you like to "save" or "load" dataframe?\n("save","load","no"):')

# if save_or_load.lower()=='save':
#     # Save csv
#     df.to_csv('russian_troll_tweets_eng_only_date_pub_index.csv')
    
# if save_or_load.lower()=='load':
#     import bs_ds as bs
#     from bs_ds.imports import *
#     # Load csva
#     df = pd.read_csv('russian_troll_tweets_eng_only_date_pub_index.csv')    

In [None]:
for i in range(10):
    print(i,'\t',np.random.choice(df['content']))

### Recasting Publish date as datetime column (date_published)

In [None]:
# Recast date_published as datetime and make index
df['date_published'] = pd.to_datetime(df['publish_date'])
df.set_index('date_published', inplace=True)
print('Changed index to datetime "date_published".')

In [None]:
# Convert publish_date to datetime
# df['date_published'] = pd.to_datetime(df.publish_date)
print(f'Tweet dates from {np.min(df.index)}  to  {np.max(df.index)}')

# Using TwitterAPI to Harvest Control Tweets

## My Search Strategy

- **We need non-troll Tweets to use as a control for the Troll tweets. Ideally, these would be from the same time period covered by the Troll tweets.**

    - However, extracting batch historical tweets from the same time period is not an option. This would **require Twitter Enterprise level** Developer Membership (which costs **\\$2,000 per month**)
    - The free Twitter developer account access allows extracting Tweets from the last 7 days. We will have to work within this limitation for harvesting control tweets.
    - Due to the temporal difference, extra care must be put into Tweet search strategy.



**Inspect Data to get search parameters:**
- [X] Get the date range for the English tweets in the original dataset<br>
    - **Tweet date range:**
        - **2012-02-06** to **2018-05-30**

- [X] Get a list of the hash tags (and their frequencies from the dataframe


In [None]:
# Inspect Data to get search parameters:
print(f'Tweet date range:\n {min(df.index)} to {max(df.index)}')
print(f'\nTotal days: {max(df.index)-min(df.index)}')

## Determining Hashtags & @'s to search for

- Use regular expressions to extract the hashtags #words and @handles.
- Use the top X many tags as search terms for twitter API
    - There are _1,678,170 unique hashtags_ and _1,165,744 unique @'s_

In [None]:
# NEW: Make a column containing all hashtags and mentions
import re
hashtags = re.compile(r'(\#\w*)')
# df['hashtags'] = df['content'].map(lambda x: hashtags.findall(str(x)))

mentions = re.compile(r'(\@\w*)')
# df['mentions'] = df['content'].map(lambda x: mentions.findall(str(x)))

urls = re.compile(r"(http[s]?://\w*\.\w*/+\w+)")
# df['links'] = df['content'].map(lambda x: urls.findall(str(x)))

# Testing individual re's from above
hashtag_list = []
hashtag_list = df['content'].map(lambda x: hashtags.findall(str(x)))

In [None]:
from tqdm import tqdm
all_hashtags = []
for i in range(len(hashtag_list)):
    if len(hashtag_list[i])==0:
        continue
    elif len(hashtag_list[i])>1:
        [all_hashtags.append(x) for x in hashtag_list[i]]

    else:
        all_hashtags.append(hashtag_list[i])
    
hashtag_counts = pd.Series(all_hashtags)
hashtag_counts.value_counts()

### def get_tags_ats

In [None]:
# Define get_tags_ats to accept a list of text entries and return all found tags and ats as 2 series/lists
def get_tags_ats(text_to_search,exp_tag = r'(#\w*)',exp_at = r'(@\w*)', output='series',show_counts=False):
    """Accepts a list of text entries to search, and a regex for tags, and a regex for @'s.
    Joins all entries in the list of text and then re.findsall() for both expressions.
    Returns a series of found_tags and a series of found_ats.'"""
    import re
    
    # Create a single long joined-list of strings
    text_to_search_combined = ' '.join(text_to_search)
        
    # print(len(text_to_search_combined), len(text_to_search_list))
    found_tags = re.findall(exp_tag, text_to_search_combined)
    found_ats = re.findall(exp_at, text_to_search_combined)
    
    if output.lower() == 'series':
        found_tags = pd.Series(found_tags, name='tags')
        found_ats = pd.Series(found_ats, name='ats')
        
        if show_counts==True:
            print(f'\t{found_tags.name}:\n{tweet_tags.value_counts()} \n\n\t{found_ats.name}:\n{tweet_ats.value_counts()}')
                
    if (output.lower() != 'series') & (show_counts==True):
        raise Exception('output must be set to "series" in order to show_counts')
                       
    return found_tags, found_ats

In [None]:
# Need to get a list of hash tags.
text_to_search_list = []

for i in range(len(df)):    
    tweet_contents =df['content'].iloc[i]
    text_to_search_list.append(tweet_contents)

text_to_search_list[:2]

In [None]:
# Get all tweet tags and @'s from text_to_search_list
tweet_tags, tweet_ats = get_tags_ats(text_to_search_list, show_counts=False)

print(f"There were {len(tweet_tags)} unique hashtags and {len(tweet_ats)} unique @'s\n")

# Create a dataframe with top_tags
df_top_tags = pd.DataFrame(tweet_tags.value_counts()[:40])#,'\n')
df_top_tags['% Total'] = (df_top_tags['tags']/len(tweet_tags)*100)

# Create a dataframe with top_ats
df_top_ats = pd.DataFrame(tweet_ats.value_counts()[:40])
df_top_ats['% Total'] = (df_top_ats['ats']/len(tweet_ats)*100)

# Display top tags and ats
# bs.display_side_by_side(df_top_tags,df_top_ats)

### Notes on Top Tags and Ats:


In [None]:
# Choose list of top tags to use in search
list_top_30_tags = df_top_tags.index[:30]
list_top_30_tags

In [None]:
# Choose list of top tags to use in search
list_top_30_ats = df_top_ats.index[:30]
list_top_30_ats

## Summary of Tweet Search Strategy 
- **The most common hashtags include some very generic categories** that will not be appropriate control tweets.
    - Examples:
        - '#news','#sports','#politics','#world','#local','#TopNews','#health','#business','#tech'
    - If we used these to extract control Tweets, our model would be biased, since many of these categories contain time-specific topics and would therefore be easy to predict vs. the troll tweets.
  
- **The most common @'s are much more revealing and helpful in narrowing the focus of the results.**
    - Final decision is to use the top 40 mentions from the trolls tweets and extracting present-day Tweets with the same mentions. 
___

# Using the Twitter Search API to Extract Control Tweets

- [x] Required API key are saved in the Main folder in which this repo is saved. 
- [x] Check the [Premium account docs for search syntax](https://developer.twitter.com/en/docs/tweets/search/guides/premium-operators.html)
- [x] [Check this article for using Tweepy for most efficient twitter api extraction](https://bhaskarvk.github.io/2015/01/how-to-use-twitters-search-rest-api-most-effectively./)

**LINK TO PREMIUM SEARCH API GUIDE**<br>
https://developer.twitter.com/en/docs/tweets/search/api-reference/premium-search

**Available search operators**
- Premium search API supports rules with up to 1,024 characters. The Search Tweets APIs support the premium operators listed below. See our Premium operators guide for more details.

- The base URI for the premium search API is https://api.twitter.com/1.1/tweets/search/.

**Matching on Tweet contents:**
- keyword , "quoted phrase" , # , @, url , lang


## Using `tweepy` to access twitter API

### def connect_twitter_api, def search_twitter_api

In [None]:
# Initialzie Tweepy with Authorization Keys    
def connect_twitter_api(api_key, api_secret_key):
    import tweepy, sys
    auth = tweepy.AppAuthHandler(api_key, api_secret_key)
    api = tweepy.API(auth, wait_on_rate_limit=True, wait_on_rate_limit_notify=True)

    if (not api):
        print("Can't authenticate.")
        sys.exit(-1)
    return api

In [None]:
def search_twitter_api(api_object, searchQuery, maxTweets, fName, tweetsPerQry=100, max_id=0, sinceId=None):
    """Take an authenticated tweepy api_object, a search queary, max# of tweets to retreive, a desintation filename.
    Uses tweept.api.search for the searchQuery until maxTweets is reached, saved harvest tweets to fName."""
    import sys, jsonpickle, os
    api = api_object
    tweetCount = 0
    print(f'Downloading max{maxTweets} for {searchQuery}...')
    with open(fName, 'a+') as f:
        while tweetCount < maxTweets:

            try:
                if (max_id <=0):
                    if (not sinceId):
                        new_tweets = api.search(q=searchQuery, count=tweetsPerQry, tweet_mode='extended')
                    else:
                        new_tweets = api.search(q=searchQuery, count=tweetsPerQry, since_id=sinceId, tweet_mode='extended')

                else:
                    if (not sinceId):
                        new_tweets = api.search(q=searchQuery, count=tweetsPerQry, max_id=str(max_id-1), tweet_mode='extended')
                    else:
                        new_tweets = api.search(q=searchQuery, count=tweetsPerQry, max_id=str(max_id-1),since_id=sinceId, tweet_mode='extended')

                if not new_tweets:
                    print('No more tweets found')
                    break

                for tweet in new_tweets:
                    f.write(jsonpickle.encode(tweet._json, unpicklable=False)+'\n')

                tweetCount+=len(new_tweets)

                print("Downloaded {0} tweets".format(tweetCount))
                max_id = new_tweets[-1].id

            except tweepy.TweepError as e:
                # Just exit if any error
                print("some error : " + str(e))
                break
    print ("Downloaded {0} tweets, Saved to {1}\n".format(tweetCount, fName))

## Connect to Twitter and Harvest Tweets

### Making lists of tags and ats to query

In [None]:
# Figure out the # of each @ and each # that i want ot query, then make a query_dict to feed into the cell below
query_ats = tuple(zip(df_top_ats.index, df_top_ats['ats']))
query_tags = tuple(zip(df_top_tags.index, df_top_tags['tags']))

# Calculate how many tweets are represented by the top 30 tags and top 30 @'s 
sum_top_tweet_tags = df_top_tags['tags'].sum()
sum_top_tweet_ats = df_top_ats['ats'].sum()
print(f"Sum of top tags = {sum_top_tweet_tags}\nSum of top @'s = {sum_top_tweet_ats}")

print(query_ats[:10],'\n')
print(query_tags[:10])

In [None]:
np.sum([x[1] for x in query_ats])

In [None]:
# Inspect Data to get search parameters:
print(f'Tweet date range:\n {min(df.index)} to {max(df.index)}')
print(f'\nTotal days: {max(df.index)-min(df.index)}')

### Connecting to twitter api and searching for lists of queries

In [None]:
# Import API keys from text files (so not displayed here and not in repo)
with open('../consumer_API_key.txt','r') as f:
    api_key =  f.read()
with open('../consumer_API_secret_key.txt','r') as f:
    api_secret_key  = f.read()

#### Test searches

In [None]:
# Manually connecting to API and doing test searches. 
import tweepy, sys
auth = tweepy.AppAuthHandler(api_key, api_secret_key)
api = tweepy.API(auth, wait_on_rate_limit=True, wait_on_rate_limit_notify=True)

if (not api):
    print("Can't authenticate.")
    sys.exit(-1)

In [None]:
# Search for a batch of test results
searchQuery='#politics'
tweetsPerQry=100

new_tweets = api.search(q=searchQuery, count=tweetsPerQry, tweet_mode='extended')
type(new_tweets)

In [None]:
#  Display time range of new_tweets so i can define a timetrange to test
test_dates = [x.created_at for x in new_tweets]
print(f'Range:{min(test_dates)} to {max(test_dates)}')
test_dates[0], test_dates[-1]

In [None]:
from datetime import datetime
end_time = datetime(2019,6,2,20,0,0)
end_time

In [None]:
## DEFINING A NEW FUNCTION TO EXAMINE THE NEW_TWEETS OUTPUTS
def check_tweet_daterange(new_tweets,timerange_begin,timerange_end,verbose=0):
    """Examines specific information for each tweet in a tweepy searchResults object."""
    
    time_start = timerange_begin
    time_end = timerange_end
    
    # Pull out each tweet's status object. 
    idx_keep_tweets = []
    for i,tweet in enumerate(new_tweets):
        if (tweet.created_at > time_start) and (tweet.created_at < time_end):
            idx_keep_tweets.append(i)
            if verbose>0:
                print(f'tweet({i} kept:{tweet.created_at})')
    return idx_keep_tweets

In [None]:
# Determining search criteria to limit twitter results to
latest_date = max(df.index) # Get latest date from troll tweets
earliest_date = min(df.index) # Get the earliest date from troll tweets

# Convert pandas timestamps to datetime object for tweet results
latest_datetime = latest_date.to_pydatetime()
earliest_datetime = earliest_date.to_pydatetime()

#### Automated Searches:

In [None]:
api = connect_twitter_api(api_key,api_secret_key)

In [None]:
# Extract tweets for top @'s, while matching the distribution of top @'s
final_query_list = query_ats
filename = 'tweets_for_top40_ats.txt'

for q in final_query_list:
    searchQuery = q[0]
    maxTweets = q[1]
    print(f'Query={searchQuery}, max={maxTweets}')
    search_twitter_api(api, searchQuery, maxTweets, fName=filename)

## Processing Extracted Tweets from API to match Troll Tweet Features

In [None]:
df_ats = pd.read_json('tweets_for_top40_ats.txt', lines=True)

df_ats.head()

In [None]:
df_ats['entities'][1]

In [None]:
df = df_ats.loc[df_ats['full_text'].str.contains('RT')]

df.head()

In [None]:
import re
re_RT=re.compile('RT \@\w*\:')

In [None]:
example_rt_tweets = df['full_text'][:10]
example_rt_tweets

In [None]:
example_rt_tweets.apply(lambda x: re_RT.sub('',x))

In [None]:
quote_list = (['idx','quoted_status'])
for k,v in df['quoted_status'].items()[:10]:
    print
    quote_list.append([k,v])

In [None]:
df['retweeted_status'][1]

In [None]:
df_ats.info()

In [None]:
pause

### Notes on Making New Extracted Tweets Match Russian Troll Tweet Database

- Columns to be renamed/reformatted to match troll tweets:
    - `created_at` -> `date_published`-> index
    - `full_text` -> `content`
    - `df['user']`
        - `.['followers_count']` -> `'following'`
        - `.['followers_count']` -> `'followers'`

- Columns missing from original troll tweets (to be removed).
    -coordinates, favorited, favorite_count, display_text_range, withheld_in_countries    

In [None]:
df_ats['lang'][1]

In [None]:
df_ats.user[10]

In [None]:
df_ats.entities[0]

In [None]:
print(df_ats['user'][1]['location'])

In [None]:
print(df_ats['user'][0]['id'])
print(df_ats['user'][0]['screen_name'])
print(df_ats['user'][0]['followers_count'])
print(df_ats['user'][0]['following'])

In [None]:
df_test=pd.DataFrame()

In [None]:
idx_row = df_ats.index[2]
curr_row = df_ats.loc[df_ats.index==idx_row]
curr_author = curr_row['user']
curr_author

In [None]:
df_columns_list =['external_author_id', 'author', 'content', 'region', 'following', 'followers', 'updates', 'post_type',
 'account_type', 'retweet', 'account_category']
df_export = pd.DataFrame(columns=df_columns_list)

In [None]:
df_export.loc[0,'external_author_id']='test'
df_export

In [None]:
curr_author[0]

In [None]:
full_text=curr_row['ret']
full_text

In [None]:
curr_row['user']
row

In [None]:
df_export=pd.DataFrame()
df_columns_list =['external_author_id', 'author', 'content', 'region', 'following', 'followers', 'updates', 'post_type',
 'account_type', 'retweet', 'account_category']

df_export = pd.DataFrame(columns=df_columns_list)

for row in df_ats.index:
    
    curr_row = df_ats.loc[df_ats.index==row]
    curr_author = curr_row['user'][row]
    external_author_id = curr_author['id']
    author =  curr_author['screen_name']
    following = curr_author['following']
    followers = curr_author['followers_count']
    region = curr_author['location']
    full_text = curr_row['full_text'][row]
    
    df_export.loc[row, 'external_author_id'] = external_author_id
    df_export.loc[row, 'author'] = author
    df_export.loc[row, 'content'] = full_text
    df_export.loc[row, 'region'] = region
    df_export.loc[row, 'following'] = following
    df_export.loc[row, 'followers'] = followers
    df_export.loc[row, 'updates'] = np.nan
    df_export.loc[row, 'post_type'] = 'control'
    df_export.loc[row, 'account_type'] = 'control'
    df_export.loc[row, 'retweet'] = curr_row['retweeted'][row]
    df_export.loc[row, 'account_category'] = 'control' 
    df_export.loc[row, 'publish_date'] = curr_row['created_at'][row]
    df_export.loc[row, 'language'] = curr_row['lang'][row]

In [None]:
df_export.head()

In [None]:
df_export.to_csv('newly_extracted_control_tweets.csv')