# 3 - Million Russian Troll Tweets
- James M Irving, Ph.D.
- Mod 4 Project
- Flatiron Full Time Data Science Bootcamp - 02/2019 Cohort

## GOAL: 

- *IF I can get a control dataset* of non-Troll tweets from same time period with similar hashtags:*
    - Use NLP to predict of a tweet is from an authentic user or a Russian troll.
- *If no control tweets to compare to*
    - Use NLP to predict how many retweets a Troll tweet will get.
    - Consider both raw # of retweets, as well as a normalized # of retweets/# of followers.
        - The latter would give better indication of language's effect on propagation. 
        

## Dataset Features:
- Kaggle Dataset published by FiveThirtyEight
    - https://www.kaggle.com/fivethirtyeight/russian-troll-tweets/downloads/russian-troll-tweets.zip/2
<br>    
- Data is split into 9 .csv files
    - 'IRAhandle_tweets_1.csv' to 9

- **Variables:**
    - ~~`external_author_id` | An author account ID from Twitter~~
    - `author` | The handle sending the tweet
    - `content` | The text of the tweet
    - `region` | A region classification, as [determined by Social Studio](https://help.salesforce.com/articleView?   id=000199367&type=1)
    - `language` | The language of the tweet
    - `publish_date` | The date and time the tweet was sent
    - ~~`harvested_date` | The date and time the tweet was collected by Social Studio~~
    - `following` | The number of accounts the handle was following at the time of the tweet
    - `followers` | The number of followers the handle had at the time of the tweet
    - `updates` | The number of “update actions” on the account that authored the tweet, including tweets, retweets and likes
    - `post_type` | Indicates if the tweet was a retweet or a quote-tweet *[Whats a quote-tweet?]*
    - `account_type` | Specific account theme, as coded by Linvill and Warren
    - `retweet` | A binary indicator of whether or not the tweet is a retweet [?]
    - `account_category` | General account theme, as coded by Linvill and Warren
    - `new_june_2018` | A binary indicator of whether the handle was newly listed in June 2018
    
### **Classification of account_type**
Taken from: [rcmediafreedom.eu summary](https://www.rcmediafreedom.eu/Publications/Academic-sources/Troll-Factories-The-Internet-Research-Agency-and-State-Sponsored-Agenda-Building)

>- **They identified five categories of IRA-associated Twitter accounts, each with unique patterns of behaviors:**
    - **Right Troll**, spreading nativist and right-leaning populist messages. It supported the candidacy and Presidency of Donald Trump and denigrated the Democratic Party. It often sent divisive messages about mainstream and moderate Republicans.
    - **Left Troll**, sending socially liberal messages and discussing gender, sexual, religious, and -especially- racial identity. Many tweets seemed intentionally divisive, attacking mainstream Democratic politicians, particularly Hillary Clinton, while supporting Bernie Sanders prior to the election.
    - **News Feed**, overwhelmingly presenting themselves as U.S. local news aggregators, linking to legitimate regional news sources and tweeting about issues of local interest.
    - **Hashtag Gamer**, dedicated almost exclusively to playing hashtag games.
    - **Fearmonger**: spreading a hoax about poisoned turkeys near the 2015 Thanksgiving holiday.

>The different types of account were used differently and their efforts were conducted systematically, with different allocation when faced with different political circumstances or shifting goals. E.g.: there was a spike of activity by right and left troll accounts before the publication of John Podesta's emails by WikiLeaks. According to the authors, this activity can be characterised as “industrialized political warfare”.

___

In [None]:
import bs_ds as bs
from bs_ds.imports import *

In [None]:
import os
root_dir = 'russian-troll-tweets/'
# os.listdir('russian-troll-tweets/')
filelist = [os.path.join(root_dir,file) for file in os.listdir(root_dir) if file.endswith('.csv')]
filelist

In [None]:
# Previewing dataset
df = pd.read_csv(filelist[0])
df.head(3)

## Merging full dataset

In [None]:
# Vertically concatenate 
for file in filelist:
    df_new = pd.read_csv(file)
    df = pd.concat([df,df_new], axis=0)
df.info()

In [None]:
df.head()

# SCRUBBING/EDA

In [None]:
from pandas_profiling import ProfileReport
ProfileReport(df)

In [None]:
df.info()

## Observations from Inspection / Pandas_Profiling ProfileReport

- **Language to Analyze is in `Content`:**
    - Actual tweet contents. 
 
- **Classification/Analysis Thoughts:**
    - **Variables should be considered in 2 ways:**
        - First, the tweet contents. 
            - Use NLP to engineer features to feed into deep learning.
                - Sentiment analysis, named-entity frequency/types, most-similar words. 
        - Second, the tweet metadata. 
        
### Thoughts on specific features:
- `language`
    - There are 56 unique languages. 
    - 2.4 million are English, 670 K are in Russian, etc.
    - Note: for metadata, analyzing if an account posts in more than 1 language may be a good predictor. 
- `followers`/`following`
    - **following** could be informative if goal is to predict if its a troll tweet.
    - **followers** should be used (with retweets) if predicting retweets based on content. 

- **Questions:**
    - [ ] Why are so many post_types missing? (55%?)
    
### Scrubing to Perform
- **Recast Columns:**
    - [ ] `publish_date` to datetime. 
- **Columns to Discard:**
    - [ ] `external_author_id` ( we have author handle)
    - [ ] `harvested_date` (we care about publish_date, if anything, time-wise)
    - [ ] `language`: remove all non-english tweets and drop column
    - [ ] `new_june_2018`

In [None]:
# Drop non-english rows
df = df.loc[df.language=='English']
# df.info()

In [None]:
cols_to_drop = ['external_author_id','harvested_date','new_june_2018']#: remove all non-english tweets and drop column

for col in cols_to_drop:
    df.drop(col, axis=1, inplace=True)

df.info()

### Recasting Publish date as datetime column (date_published)

In [None]:
# Convert publish_date to datetime
df['date_published'] = pd.to_datetime(df.publish_date)
print(np.max(df.date_published), np.min(df.date_published))

In [None]:
df.set_index('date_published',inplace=True)

In [None]:
df.head()

___
# Save/Load and Resume

In [None]:
# Save csv
# df.to_csv('russian_troll_tweets_eng_only_date_pub_index.csv')

In [None]:
# bs.check_unique(df,['region'])

In [None]:
import bs_ds as bs
from bs_ds.imports import *
# Load csva
df = pd.read_csv('russian_troll_tweets_eng_only_date_pub_index.csv')

# Recast date_published as datetime and make index
df.date_published = pd.to_datetime(df['date_published'])
df.set_index('date_published', inplace=True)
print('Changed index to datetime "date_published".')

# Drop un-needed columns
cols_to_drop = ['publish_date','language']
for col in cols_to_drop:
    
    df.drop(col, axis=1, inplace=True)
    print(f'Dropped {col}.')
    
    
# Recast categorical columns
cols_to_cats = ['region','post_type','account_type','account_category']
for col in cols_to_cats:
    
    df[col] = df[col].astype('category')
    print(f'Converted {col} to category.')


# Drop problematic nan in 'contet'
df.dropna(subset=['content'],inplace=True) # Dropping the 1 null value 

df.head()

In [None]:
# bs.big_pandas()
# pd.set_option('display.max_info_columns',500)

In [None]:
df.info()

# Thoughts on My Search Strategy

**My Twitter API Link:**<br>
https://api.twitter.com/1.1/tweets/search/fullarchive/search.json


**Inspect Data to get search parameters:**
- [X] Get the date range for the English tweets in the original dataset<br>
    - **Tweet date range:**
        - **2012-02-06** to **2018-05-30**

- [X] Get a list of the hash tags (and their frequencies from the dataframe

**Determine most feasible and balanced well of extracting control tweets**
- [ ] How many of each tag / @'s should I try to exctract?
- [ ] what are the limitations of the API that will be a road block to getting as many tweets as desired?

In [None]:
# Inspect Data to get search parameters:
print(f'Tweet date range:\n {min(df.index)} to {max(df.index)}')
print(f'\nTotal days: {max(df.index)-min(df.index)}')

### Determining Hashtags & @'s

- Use regular expressions to extract the hashtags #words and @handles.
- Use the top X many tags as search terms for twitter API
    - There are _1,678,170 unique hashtags_ and _1,165,744 unique @'s_

In [None]:
# Define get_tags_ats to accept a list of text entries and return all found tags and ats as 2 series/lists
def get_tags_ats(text_to_search,exp_tag = r'(#\w*)',exp_at = r'(@\w*)', output='series',show_counts=False):
    """Accepts a list of text entries to search, and a regex for tags, and a regex for @'s.
    Joins all entries in the list of text and then re.findsall() for both expressions.
    Returns a series of found_tags and a series of found_ats.'"""
    import re
    
    # Create a single long joined-list of strings
    text_to_search_combined = ' '.join(text_to_search)
        
    # print(len(text_to_search_combined), len(text_to_search_list))
    found_tags = re.findall(exp_tag, text_to_search_combined)
    found_ats = re.findall(exp_at, text_to_search_combined)
    
    if output.lower() == 'series':
        found_tags = pd.Series(found_tags, name='tags')
        found_ats = pd.Series(found_ats, name='ats')
        
        if show_counts==True:
            print(f'\t{found_tags.name}:\n{tweet_tags.value_counts()} \n\n\t{found_ats.name}:\n{tweet_ats.value_counts()}')
                
    if (output.lower() != 'series') & (show_counts==True):
        raise Exception('output must be set to "series" in order to show_counts')
                       
    return found_tags, found_ats

In [None]:
# Need to get a list of hash tags.
text_to_search_list = []

for i in range(len(df)):    
    tweet_contents =df['content'].iloc[i]
    text_to_search_list.append(tweet_contents)

text_to_search_list[:2]

In [None]:
# Get all tweet tags and @'s from text_to_search_list
tweet_tags, tweet_ats = get_tags_ats(text_to_search_list, show_counts=False)

print(f"There were {len(tweet_tags)} unique hashtags and {len(tweet_ats)} unique @'s\n")

# Create a dataframe with top_tags
df_top_tags = pd.DataFrame(tweet_tags.value_counts()[:40])#,'\n')
df_top_tags['% Total'] = (df_top_tags['tags']/len(tweet_tags)*100)

# Create a dataframe with top_ats
df_top_ats = pd.DataFrame(tweet_ats.value_counts()[:40])
df_top_ats['% Total'] = (df_top_ats['ats']/len(tweet_ats)*100)

# Display top tags and ats
# bs.display_side_by_side(df_top_tags,df_top_ats)

### Notes on Top Tags and Ats:


- The most common tags include some very generic categories that may not be helpful in extracting control tweets.
    - ~~Exclude: '#news','#sports','#politics','#world','#local','#TopNews','#health','#business','#tech',~~
    - On second thought, this is entirely appropriate, since these tags would be what appears in the wild.
    - Additionally, using a larger number of them (like 30, starts to provide more targeted hashtags.<br><br>
  
- **The most common @'s are much more revealing and helpful in narrowing the focus of the results.**

In [None]:
# Choose list of top tags to use in search
list_top_30_tags = df_top_tags.index[:30]
list_top_30_tags

In [None]:
# Choose list of top tags to use in search
list_top_30_ats = df_top_ats.index[:30]
list_top_30_ats

## Using the Twitter Search API to Extract Control Tweets

- [x] Required API key are saved in the Main folder in which this repo is saved. 
- [x] Check the [Premium account docs for search syntax](https://developer.twitter.com/en/docs/tweets/search/guides/premium-operators.html)
- [x] [Check this article for using Tweepy for most efficient twitter api extraction](https://bhaskarvk.github.io/2015/01/how-to-use-twitters-search-rest-api-most-effectively./)

**LINK TO PREMIUM SEARCH API GUIDE**<br>
https://developer.twitter.com/en/docs/tweets/search/api-reference/premium-search

### Available operators
- Premium search API supports rules with up to 1,024 characters. The Search Tweets APIs support the premium operators listed below. See our Premium operators guide for more details.

- The base URI for the premium search API is https://api.twitter.com/1.1/tweets/search/.

**Matching on Tweet contents:**
- keyword
- "quoted phrase"
- #
- @
- url:
- lang:


**API Post Search Methods**
- POST /search/:product/:label	
    - Retrieve Tweets matching the specified query.
- POST /search/:product/:label/counts	
    - Retrieve the number of Tweets matching the specified query.
- where:
    - `:product` indicates the search endpoint you are making requests to, either 30day or fullarchive.
    - `:label` is the (case-sensitive) label associated with your search developer environment, as displayed at https://developer.twitter.com/en/account/environments.
- For example, if using the 30-day endpoint and your dev environment has a label of 'dev' (short for development), the search URLs would be:
    - Data endpoint providing Tweets: https://api.twitter.com/1.1/tweets/search/30day/dev.json
    - Counts endpoint providing counts of Tweets: https://api.twitter.com/1.1/tweets/search/30day/dev/counts.json

 

### Using tweepy to access twitter API

- [Helpful tutorial on _most efficient_ way to access twitter API](https://bhaskarvk.github.io/2015/01/how-to-use-twitters-search-rest-api-most-effectively./)

In [None]:
# Import API keys from text files (so not displayed here and not in repo)
with open('../consumer_API_key.txt','r') as f:
    api_key =  f.read()
with open('../consumer_API_secret_key.txt','r') as f:
    api_secret_key  = f.read()

In [None]:
# Initialzie Tweepy with Authorization Keys    
def connect_twitter_api(api_key, api_secret_key):
    import tweepy, sys
    auth = tweepy.AppAuthHandler(api_key, api_secret_key)
    api = tweepy.API(auth, wait_on_rate_limit=True, wait_on_rate_limit_notify=True)

    if (not api):
        print("Can't authenticate.")
        sys.exit(-1)
    return api

In [None]:
api = connect_twitter_api(api_key,api_secret_key)

In [None]:
df_top_ats.ats[:20]

In [None]:
# Figure out the # of each @ and each # that i want ot query, then make a query_dict to feed into the cell below
query_ats = tuple(zip(df_top_ats.index, df_top_ats['ats']))
auery_tags = tuple(zip(df_top_tags.index, df_top_tags['tags']))

# Calculate how many tweets are represented by the top 30 tags and top 30 @'s 
sum_top_tweet_tags = df_top_tags['tags'].sum()
sum_top_tweet_ats = df_top_ats['ats'].sum()
print(f"Sum of top tags = {sum_top_tweet_tags}\nSum of top @'s = {sum_top_tweet_ats}")

In [None]:
def search_twitter_api(api_object, searchQuery, maxTweets, fName, tweetsPerQry=100, max_id=0, sinceId=None):
    """Take an authenticated tweepy api_object, a search queary, max# of tweets to retreive, a desintation filename.
    Uses tweept.api.search for the searchQuery until maxTweets is reached, saved harvest tweets to fName."""
    import sys, jsonpickle, os

    tweetCount = 0
    print(f'Downloading max{maxTweets} for {searchQuery}...')
    with open(fName, 'a+') as f:
        while tweetCount < maxTweets:

            try:
                if (max_id <=0):
                    if (not sinceId):
                        new_tweets = api.search(q=searchQuery, count=tweetsPerQry, tweet_mode='extended')
                    else:
                        new_tweets = api.search(q=searchQuery, count=tweetsPerQry, since_id=sinceId, tweet_mode='extended')

                else:
                    if (not sinceId):
                        new_tweets = api.search(q=searchQuery, count=tweetsPerQry, max_id=str(max_id-1), tweet_mode='extended')
                    else:
                        new_tweets = api.search(q=searchQuery, count=tweetsPerQry, max_id=str(max_id-1),since_id=sinceId, tweet_mode='extended')

                if not new_tweets:
                    print('No more tweets found')
                    break

                for tweet in new_tweets:
                    f.write(jsonpickle.encode(tweet._json, unpicklable=False)+'\n')

                tweetCount+=len(new_tweets)

                print("Downloaded {0} tweets".format(tweetCount))
                max_id = new_tweets[-1].id

            except tweepy.TweepError as e:
                # Just exit if any error
                print("some error : " + str(e))
                break
    print ("Downloaded {0} tweets, Saved to {1}\n".format(tweetCount, fName))



In [None]:
# Extract tweets for top @'s, while matching the distribution of top @'s

filename = 'top_tweets_by_at.txt'

for q in query_ats[:3]:
    searchQuery = q[0]
    maxTweets = q[1]
    print(f'Query={searchQuery}, max={maxTweets}')
    search_twitter_api(api, searchQuery, maxTweets, fName=filename)
#     time.sleep(1)
    
# toc(t0)


___

In [None]:
df_tweets = pd.read_json('top_tweets_by_at_06022019.txt', lines=True)

In [None]:
df_tweets.head()

In [None]:
df_tweets['date_published'] = pd.to_datetime(df_tweets['created_at'])

In [None]:
df_tweets['date_published'].min(), df_tweets['date_published'].max()