# 3 - Million Russian Troll Tweets
- James M Irving, Ph.D.
- Mod 4 Project
- Flatiron Full Time Data Science Bootcamp - 02/2019 Cohort

## GOAL: 

- *IF I can get a control dataset* of non-Troll tweets from same time period with similar hashtags:*
    - Use NLP to predict of a tweet is from an authentic user or a Russian troll.
- *If no control tweets to compare to*
    - Use NLP to predict how many retweets a Troll tweet will get.
    - Consider both raw # of retweets, as well as a normalized # of retweets/# of followers.
        - The latter would give better indication of language's effect on propagation. 
        

## Dataset Features:
- Kaggle Dataset published by FiveThirtyEight
    - https://www.kaggle.com/fivethirtyeight/russian-troll-tweets/downloads/russian-troll-tweets.zip/2
<br>    
- Data is split into 9 .csv files
    - 'IRAhandle_tweets_1.csv' to 9

- **Variables:**
    - ~~`external_author_id` | An author account ID from Twitter~~
    - `author` | The handle sending the tweet
    - `content` | The text of the tweet
    - `region` | A region classification, as [determined by Social Studio](https://help.salesforce.com/articleView?   id=000199367&type=1)
    - `language` | The language of the tweet
    - `publish_date` | The date and time the tweet was sent
    - ~~`harvested_date` | The date and time the tweet was collected by Social Studio~~
    - `following` | The number of accounts the handle was following at the time of the tweet
    - `followers` | The number of followers the handle had at the time of the tweet
    - `updates` | The number of “update actions” on the account that authored the tweet, including tweets, retweets and likes
    - `post_type` | Indicates if the tweet was a retweet or a quote-tweet *[Whats a quote-tweet?]*
    - `account_type` | Specific account theme, as coded by Linvill and Warren
    - `retweet` | A binary indicator of whether or not the tweet is a retweet [?]
    - `account_category` | General account theme, as coded by Linvill and Warren
    - `new_june_2018` | A binary indicator of whether the handle was newly listed in June 2018
    
### **Classification of account_type**
Taken from: [rcmediafreedom.eu summary](https://www.rcmediafreedom.eu/Publications/Academic-sources/Troll-Factories-The-Internet-Research-Agency-and-State-Sponsored-Agenda-Building)

>- **They identified five categories of IRA-associated Twitter accounts, each with unique patterns of behaviors:**
    - **Right Troll**, spreading nativist and right-leaning populist messages. It supported the candidacy and Presidency of Donald Trump and denigrated the Democratic Party. It often sent divisive messages about mainstream and moderate Republicans.
    - **Left Troll**, sending socially liberal messages and discussing gender, sexual, religious, and -especially- racial identity. Many tweets seemed intentionally divisive, attacking mainstream Democratic politicians, particularly Hillary Clinton, while supporting Bernie Sanders prior to the election.
    - **News Feed**, overwhelmingly presenting themselves as U.S. local news aggregators, linking to legitimate regional news sources and tweeting about issues of local interest.
    - **Hashtag Gamer**, dedicated almost exclusively to playing hashtag games.
    - **Fearmonger**: spreading a hoax about poisoned turkeys near the 2015 Thanksgiving holiday.

>The different types of account were used differently and their efforts were conducted systematically, with different allocation when faced with different political circumstances or shifting goals. E.g.: there was a spike of activity by right and left troll accounts before the publication of John Podesta's emails by WikiLeaks. According to the authors, this activity can be characterised as “industrialized political warfare”.

___

In [1]:
import bs_ds as bs
from bs_ds.imports import *

Using TensorFlow backend.


View our documentation at https://bs-ds.readthedocs.io/en/latest/index.html
For convenient loading of standard modules :
>> from bs_ds.imports import *



Unnamed: 0,Module/Package Handle
pandas,pd
numpy,np
matplotlib,mpl
matplotlib.pyplot,plt
seaborn,sns


In [None]:
import os
root_dir = 'russian-troll-tweets/'
# os.listdir('russian-troll-tweets/')
filelist = [os.path.join(root_dir,file) for file in os.listdir(root_dir) if file.endswith('.csv')]
filelist

In [None]:
# Previewing dataset
df = pd.read_csv(filelist[0])
df.head(3)

## Merging full dataset

In [None]:
# Vertically concatenate 
for file in filelist:
    df_new = pd.read_csv(file)
    df = pd.concat([df,df_new], axis=0)
df.info()

In [None]:
df.head()

# SCRUBBING/EDA

In [None]:
from pandas_profiling import ProfileReport
ProfileReport(df)

In [None]:
df.info()

## Observations from Inspection / Pandas_Profiling ProfileReport

- **Language to Analyze is in `Content`:**
    - Actual tweet contents. 
 
- **Classification/Analysis Thoughts:**
    - **Variables should be considered in 2 ways:**
        - First, the tweet contents. 
            - Use NLP to engineer features to feed into deep learning.
                - Sentiment analysis, named-entity frequency/types, most-similar words. 
        - Second, the tweet metadata. 
        
### Thoughts on specific features:
- `language`
    - There are 56 unique languages. 
    - 2.4 million are English, 670 K are in Russian, etc.
    - Note: for metadata, analyzing if an account posts in more than 1 language may be a good predictor. 
- `followers`/`following`
    - **following** could be informative if goal is to predict if its a troll tweet.
    - **followers** should be used (with retweets) if predicting retweets based on content. 

- **Questions:**
    - [ ] Why are so many post_types missing? (55%?)
    
### Scrubing to Perform
- **Recast Columns:**
    - [ ] `publish_date` to datetime. 
- **Columns to Discard:**
    - [ ] `external_author_id` ( we have author handle)
    - [ ] `harvested_date` (we care about publish_date, if anything, time-wise)
    - [ ] `language`: remove all non-english tweets and drop column
    - [ ] `new_june_2018`

In [None]:
# Drop non-english rows
df = df.loc[df.language=='English']
# df.info()

In [None]:
cols_to_drop = ['external_author_id','harvested_date','new_june_2018']#: remove all non-english tweets and drop column

for col in cols_to_drop:
    df.drop(col, axis=1, inplace=True)

df.info()

### Recasting Publish date as datetime column (date_published)

In [None]:
# Convert publish_date to datetime
df['date_published'] = pd.to_datetime(df.publish_date)
print(np.max(df.date_published), np.min(df.date_published))

In [None]:
df.set_index('date_published',inplace=True)

In [None]:
df.head()

___
# Save/Load and Resume

In [None]:
# Save csv
# df.to_csv('russian_troll_tweets_eng_only_date_pub_index.csv')

In [None]:
# bs.check_unique(df,['region'])

In [1]:
import bs_ds as bs
from bs_ds.imports import *
# Load csva
df = pd.read_csv('russian_troll_tweets_eng_only_date_pub_index.csv')

# Recast date_published as datetime and make index
df.date_published = pd.to_datetime(df['date_published'])
df.set_index('date_published', inplace=True)
print('Changed index to datetime "date_published".')

# Drop un-needed columns
cols_to_drop = ['publish_date','language']
for col in cols_to_drop:
    
    df.drop(col, axis=1, inplace=True)
    print(f'Dropped {col}.')
    
    
# Recast categorical columns
cols_to_cats = ['region','post_type','account_type','account_category']
for col in cols_to_cats:
    
    df[col] = df[col].astype('category')
    print(f'Converted {col} to category.')


# Drop problematic nan in 'contet'
df.dropna(subset=['content'],inplace=True) # Dropping the 1 null value 

df.head()

bs_ds v. 0.7.2 ... read the docs at https://bs-ds.readthedocs.io/en/latest/index.html
For convenient loading of standard modules :
>> from bs_ds.imports import *



[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\james\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


Unnamed: 0,Module/Package Handle
pandas,pd
numpy,np
matplotlib,mpl
matplotlib.pyplot,plt
seaborn,sns


Changed index to datetime "date_published".
Dropped publish_date.
Dropped language.
Converted region to category.
Converted post_type to category.
Converted account_type to category.
Converted account_category to category.


Unnamed: 0_level_0,author,content,region,following,followers,updates,post_type,account_type,retweet,account_category
date_published,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
2017-10-01 19:58:00,10_GOP,"""We have a sitting Democrat US Senator on tria...",Unknown,1052,9636,253,,Right,0,RightTroll
2017-10-01 22:43:00,10_GOP,Marshawn Lynch arrives to game in anti-Trump s...,Unknown,1054,9637,254,,Right,0,RightTroll
2017-10-01 22:50:00,10_GOP,Daughter of fallen Navy Sailor delivers powerf...,Unknown,1054,9637,255,RETWEET,Right,1,RightTroll
2017-10-01 23:52:00,10_GOP,JUST IN: President Trump dedicates Presidents ...,Unknown,1062,9642,256,,Right,0,RightTroll
2017-10-01 02:13:00,10_GOP,"19,000 RESPECTING our National Anthem! #StandF...",Unknown,1050,9645,246,RETWEET,Right,1,RightTroll


In [2]:
# bs.big_pandas()
# pd.set_option('display.max_info_columns',500)

In [3]:
df.info()

<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 2420533 entries, 2017-10-01 19:58:00 to 2015-08-13 11:19:00
Data columns (total 10 columns):
author              object
content             object
region              category
following           int64
followers           int64
updates             int64
post_type           category
account_type        category
retweet             int64
account_category    category
dtypes: category(4), int64(4), object(2)
memory usage: 138.5+ MB


# Thoughts on My Search Strategy

**My Twitter API Link:**<br>
https://api.twitter.com/1.1/tweets/search/fullarchive/search.json


**Inspect Data to get search parameters:**
- [X] Get the date range for the English tweets in the original dataset<br>
    - **Tweet date range:**
        - **2012-02-06** to **2018-05-30**

- [X] Get a list of the hash tags (and their frequencies from the dataframe

**Determine most feasible and balanced well of extracting control tweets**
- [ ] How many of each tag / @'s should I try to exctract?
- [ ] what are the limitations of the API that will be a road block to getting as many tweets as desired?

In [4]:
# Inspect Data to get search parameters:
print(f'Tweet date range:\n {min(df.index)} to {max(df.index)}')
print(f'\nTotal days: {max(df.index)-min(df.index)}')

Tweet date range:
 2012-02-06 20:24:00 to 2018-05-30 20:58:00

Total days: 2305 days 00:34:00


### Determining Hashtags & @'s

- Use regular expressions to extract the hashtags #words and @handles.
- Use the top X many tags as search terms for twitter API
    - There are _1,678,170 unique hashtags_ and _1,165,744 unique @'s_

In [5]:
# Define get_tags_ats to accept a list of text entries and return all found tags and ats as 2 series/lists
def get_tags_ats(text_to_search,exp_tag = r'(#\w*)',exp_at = r'(@\w*)', output='series',show_counts=False):
    """Accepts a list of text entries to search, and a regex for tags, and a regex for @'s.
    Joins all entries in the list of text and then re.findsall() for both expressions.
    Returns a series of found_tags and a series of found_ats.'"""
    import re
    
    # Create a single long joined-list of strings
    text_to_search_combined = ' '.join(text_to_search)
        
    # print(len(text_to_search_combined), len(text_to_search_list))
    found_tags = re.findall(exp_tag, text_to_search_combined)
    found_ats = re.findall(exp_at, text_to_search_combined)
    
    if output.lower() == 'series':
        found_tags = pd.Series(found_tags, name='tags')
        found_ats = pd.Series(found_ats, name='ats')
        
        if show_counts==True:
            print(f'\t{found_tags.name}:\n{tweet_tags.value_counts()} \n\n\t{found_ats.name}:\n{tweet_ats.value_counts()}')
                
    if (output.lower() != 'series') & (show_counts==True):
        raise Exception('output must be set to "series" in order to show_counts')
                       
    return found_tags, found_ats

In [6]:
# Need to get a list of hash tags.
text_to_search_list = []

for i in range(len(df)):    
    tweet_contents =df['content'].iloc[i]
    text_to_search_list.append(tweet_contents)

text_to_search_list[:2]

['"We have a sitting Democrat US Senator on trial for corruption and you\'ve barely heard a peep from the mainstream media." ~ @nedryun https://t.co/gh6g0D1oiC',
 'Marshawn Lynch arrives to game in anti-Trump shirt. Judging by his sagging pants the shirt should say Lynch vs. belt https://t.co/mLH1i30LZZ']

In [8]:
# Get all tweet tags and @'s from text_to_search_list
tweet_tags, tweet_ats = get_tags_ats(text_to_search_list, show_counts=False)

print(f"There were {len(tweet_tags)} unique hashtags and {len(tweet_ats)} unique @'s\n")

# Create a dataframe with top_tags
df_top_tags = pd.DataFrame(tweet_tags.value_counts()[:40])#,'\n')
df_top_tags['% Total'] = (df_top_tags['tags']/len(tweet_tags)*100)

# Create a dataframe with top_ats
df_top_ats = pd.DataFrame(tweet_ats.value_counts()[:40])
df_top_ats['% Total'] = (df_top_ats['ats']/len(tweet_ats)*100)

# Display top tags and ats
# bs.display_side_by_side(df_top_tags,df_top_ats)

There were 1678170 unique hashtags and 1165744 unique @'s



### Notes on Top Tags and Ats:


- The most common tags include some very generic categories that may not be helpful in extracting control tweets.
    - ~~Exclude: '#news','#sports','#politics','#world','#local','#TopNews','#health','#business','#tech',~~
    - On second thought, this is entirely appropriate, since these tags would be what appears in the wild.
    - Additionally, using a larger number of them (like 30, starts to provide more targeted hashtags.<br><br>
  
- **The most common @'s are much more revealing and helpful in narrowing the focus of the results.**

In [9]:
# Choose list of top tags to use in search
list_top_30_tags = df_top_tags.index[:30]
list_top_30_tags

Index(['#news', '#sports', '#politics', '#world', '#local', '#MAGA',
       '#BlackLivesMatter', '#TopNews', '#tcot', '#PJNET', '#health',
       '#business', '#tech', '#entertainment', '#top', '#Cleveland', '#crime',
       '#TopVideo', '#Trump', '#NowPlaying', '#amb', '#environment', '#ISIS',
       '#breaking', '#mar', '#WakeUpAmerica', '#Miami', '#2A', '#GOPDebate',
       '#topl'],
      dtype='object')

In [10]:
# Choose list of top tags to use in search
list_top_30_ats = df_top_ats.index[:30]
list_top_30_ats

Index(['@realDonaldTrump', '@midnight', '@POTUS', '@HillaryClinton',
       '@YouTube', '@', '@CNN', '@FoxNews', '@TalibKweli', '@WarfareWW',
       '@GiselleEvns', '@WorldOfHashtags', '@deray', '@nytimes', '@josephjett',
       '@CNNPolitics', '@GOP', '@seanhannity', '@BreitbartNews',
       '@BarackObama', '@HashtagRoundup', '@tedcruz', '@washingtonpost',
       '@docrocktex26', '@ShaunKing', '@BernieSanders', '@VanJones68',
       '@mashable', '@Jenn_Abrams', '@SpeakerRyan'],
      dtype='object')

## Using the Twitter Search API to Extract Control Tweets

- [x] Required API key are saved in the Main folder in which this repo is saved. 
- [x] Check the [Premium account docs for search syntax](https://developer.twitter.com/en/docs/tweets/search/guides/premium-operators.html)
- [x] [Check this article for using Tweepy for most efficient twitter api extraction](https://bhaskarvk.github.io/2015/01/how-to-use-twitters-search-rest-api-most-effectively./)

**LINK TO PREMIUM SEARCH API GUIDE**<br>
https://developer.twitter.com/en/docs/tweets/search/api-reference/premium-search

### Available operators
- Premium search API supports rules with up to 1,024 characters. The Search Tweets APIs support the premium operators listed below. See our Premium operators guide for more details.

- The base URI for the premium search API is https://api.twitter.com/1.1/tweets/search/.

**Matching on Tweet contents:**
- keyword
- "quoted phrase"
- #
- @
- url:
- lang:


**API Post Search Methods**
- POST /search/:product/:label	
    - Retrieve Tweets matching the specified query.
- POST /search/:product/:label/counts	
    - Retrieve the number of Tweets matching the specified query.
- where:
    - `:product` indicates the search endpoint you are making requests to, either 30day or fullarchive.
    - `:label` is the (case-sensitive) label associated with your search developer environment, as displayed at https://developer.twitter.com/en/account/environments.
- For example, if using the 30-day endpoint and your dev environment has a label of 'dev' (short for development), the search URLs would be:
    - Data endpoint providing Tweets: https://api.twitter.com/1.1/tweets/search/30day/dev.json
    - Counts endpoint providing counts of Tweets: https://api.twitter.com/1.1/tweets/search/30day/dev/counts.json

 

### Using tweepy to access twitter API

- [Helpful tutorial on _most efficient_ way to access twitter API](https://bhaskarvk.github.io/2015/01/how-to-use-twitters-search-rest-api-most-effectively./)

In [11]:
import tweepy, sys

api_key = 
api_secret_key  = 

auth = tweepy.AppAuthHandler(api_key, api_secret_key)

api = tweepy.API(auth, wait_on_rate_limit=True, wait_on_rate_limit_notify=True)

if (not api):
    print("Can't authenticate.")
    sys.exit(-1)
# auth = tweepy.OAuthHandler(consumer_token, consumer_secret)

In [12]:
df_top_ats.ats

@realDonaldTrump    14999
@midnight            9991
@POTUS               5633
@HillaryClinton      5052
@YouTube             4484
@                    3758
@CNN                 3358
@FoxNews             3284
@TalibKweli          2064
@WarfareWW           1600
@GiselleEvns         1359
@WorldOfHashtags     1281
@deray               1205
@nytimes             1198
@josephjett          1164
@CNNPolitics         1163
@GOP                 1136
@seanhannity         1073
@BreitbartNews       1006
@BarackObama          995
@HashtagRoundup       947
@tedcruz              909
@washingtonpost       824
@docrocktex26         814
@ShaunKing            811
@BernieSanders        807
@VanJones68           802
@mashable             790
@Jenn_Abrams          756
@SpeakerRyan          746
@jstines3             728
@AC360                703
@CNNSitRoom           683
@KeshaTedder          681
@JakeTapper           677
@MariaSharapova       670
@TheLeadCNN           661
@WolfBlitzer          658
@BrianStelte

In [13]:
# Calculate how many tweets are represented by the top 30 tags and top 30 @'s 
sum_top_tweet_tags = df_top_tags['tags'].sum()
sum_top_tweet_ats = df_top_ats['ats'].sum()
print(f"Sum of top tags = {sum_top_tweet_tags}\nSum of top @'s = {sum_top_tweet_ats}")

Sum of top tags = 525668
Sum of top @'s = 80782


In [14]:
# Figure out the # of each @ and each # that i want ot query, then make a query_dict to feed into the cell below
query_ats = tuple(zip(df_top_ats.index, df_top_ats['ats']))
auery_tags = tuple(zip(df_top_tags.index, df_top_tags['tags']))

In [15]:
query_ats
# searchQuery = '@realDonaldTrump'
# maxTweets = 500
# tweetsPerQry = 100 
# fName = 'extracted_tweets.txt'

# max_id = 0
# sinceId = None

(('@realDonaldTrump', 14999),
 ('@midnight', 9991),
 ('@POTUS', 5633),
 ('@HillaryClinton', 5052),
 ('@YouTube', 4484),
 ('@', 3758),
 ('@CNN', 3358),
 ('@FoxNews', 3284),
 ('@TalibKweli', 2064),
 ('@WarfareWW', 1600),
 ('@GiselleEvns', 1359),
 ('@WorldOfHashtags', 1281),
 ('@deray', 1205),
 ('@nytimes', 1198),
 ('@josephjett', 1164),
 ('@CNNPolitics', 1163),
 ('@GOP', 1136),
 ('@seanhannity', 1073),
 ('@BreitbartNews', 1006),
 ('@BarackObama', 995),
 ('@HashtagRoundup', 947),
 ('@tedcruz', 909),
 ('@washingtonpost', 824),
 ('@docrocktex26', 814),
 ('@ShaunKing', 811),
 ('@BernieSanders', 807),
 ('@VanJones68', 802),
 ('@mashable', 790),
 ('@Jenn_Abrams', 756),
 ('@SpeakerRyan', 746),
 ('@jstines3', 728),
 ('@AC360', 703),
 ('@CNNSitRoom', 683),
 ('@KeshaTedder', 681),
 ('@JakeTapper', 677),
 ('@MariaSharapova', 670),
 ('@TheLeadCNN', 661),
 ('@WolfBlitzer', 658),
 ('@BrianStelter', 657),
 ('@CNNI', 655))

In [16]:
def search_twitter_api(searchQuery, maxTweets, fName, tweetsPerQry=100, max_id=0, sinceId=None):
    import sys, jsonpickle, os

    tweetCount = 0
    print(f'Downloading max{maxTweets} for {searchQuery}...')
    with open(fName, 'a+') as f:
        while tweetCount < maxTweets:

            try:
                if (max_id <=0):
                    if (not sinceId):
                        new_tweets = api.search(q=searchQuery, count=tweetsPerQry, tweet_mode='extended')
                    else:
                        new_tweets = api.search(q=searchQuery, count=tweetsPerQry, since_id=sinceId, tweet_mode='extended')

                else:
                    if (not sinceId):
                        new_tweets = api.search(q=searchQuery, count=tweetsPerQry, max_id=str(max_id-1), tweet_mode='extended')
                    else:
                        new_tweets = api.search(q=searchQuery, count=tweetsPerQry, max_id=str(max_id-1),since_id=sinceId, tweet_mode='extended')

                if not new_tweets:
                    print('No more tweets found')
                    break

                for tweet in new_tweets:
                    f.write(jsonpickle.encode(tweet._json, unpicklable=False)+'\n')

                tweetCount+=len(new_tweets)

                print("Downloaded {0} tweets".format(tweetCount))
                max_id = new_tweets[-1].id

            except tweepy.TweepError as e:
                # Just exit if any error
                print("some error : " + str(e))
                break
    print ("Downloaded {0} tweets, Saved to {1}\n".format(tweetCount, fName))
    


In [1]:
class Clock(object):
    """A clock meant to be used as a timer for functions using local time.
    Clock.tic() starts the timer, .lap() adds the current laps time to clock._list_lap_times, .toc() stops the timer.
    If user initiializes with verbose =0, only start and final end times are displays.
        If verbose=1, print each lap's info at the end of each lap.
        If verbose=2 (default, display instruction line, return datafarme of results.)
    """
    from datetime import datetime
    from pytz import timezone
    from tzlocal import get_localzone
    from bs_ds import list2df

    def get_time(self,local=True):
        """Returns current time, in local time zone by default (local=True)."""
        from datetime import datetime
        from pytz import timezone
        from tzlocal import get_localzone

        _now_utc_=datetime.now(timezone('UTC'))
        _now_local_=_now_utc_.astimezone(self._timezone_)

        if local==True:
            return _now_local_
        else:
            return _now_utc_


    def __init__(self, verbose=2):

        from datetime import datetime
        from pytz import timezone
        from tzlocal import get_localzone

        self._strformat_ = []
        self._timezone_ = []
        self._timezone_ = get_localzone()
        self._start_time_ = []
        self._lap_label_ = []
        self._lap_end_time_ = []
        self._verbose_ = []
        self._lap_duration_ = []
        self._verbose_ = verbose
        self._prior_start_time_ = []

        strformat = "%m/%d/%y - %I:%M:%S %p"
        self._strformat_ = strformat

        if self._verbose_ > 0:
            print(f'Clock created at {self.get_time().strftime(strformat)}.')

        if self._verbose_>1:
            print(f'\tStart: clock.tic()\tMark lap: clock.lap()\tStop: clock.toc()\n')



    def mark_lap_list(self, label=None):
        """Used internally, appends the current laps' information when called by .lap()
        self._lap_times_list_ = [['Lap #' , 'Start Time','Start Label','Stop Time', 'Stop Label', 'Duration']]"""
        import bs_ds as bs
#         print(self._prior_start_time_, self._lap_end_time_)
        self._lap_times_list_.append([ self._lap_counter_ , # Lap #
                                      (self._prior_start_time_).strftime(self._strformat_), # This Lap's Start Time
                                      self._start_label_, # the start label for tic
                                      self._lap_end_time_,#.strftime(self._strformat_), # stop clock time
                                      self._lap_label_, # The Label passed with .lap()
                                      self._lap_duration_.total_seconds()]) # the lap duration


    def tic(self, label=None ):
        "Start the timer and display current time, appends label to the _list_lap_times."
        from datetime import datetime
        from pytz import timezone

        self._start_time_ = self.get_time()
        self._start_label_ = label
        self._lap_counter_ = 0
        self._prior_start_time_=self._start_time_
        self._lap_times_list_=[]

        # Initiate lap counter and list
        self._lap_times_list_ = [['Lap #','Start Time','Start Label','Stop Time', 'Stop Label', 'Duration']]
        self._lap_counter_ = 0
        print(f'Clock started at {self._start_time_.strftime(self._strformat_)}')

    def toc(self,label=None):
        """Stop the timer and displays results, appends label to final _list_lap_times entry"""
        from datetime import datetime
        from pytz import timezone
        from tzlocal import get_localzone
        from bs_ds import list2df


        _final_end_time_ = self.get_time()
        _total_time_ = _final_end_time_ - self._start_time_
        _end_label_ = label

        self._lap_counter_+=1
        self._final_end_time_ = _final_end_time_
        self._lap_label_=_end_label_
        self._lap_end_time_ = _final_end_time_.strftime(self._strformat_)
        self._lap_duration_ = _final_end_time_ - self._prior_start_time_
        self._total_time_ = _total_time_
        self.mark_lap_list()

        # Append Summary Line
        print(f'\tLap #{self._lap_counter_} done @ {self._lap_end_time_}\tlabel: {self._lap_label_:>{20}}\tduration: {self._lap_duration_.total_seconds()} sec)')
        self._lap_times_list_.append(['Start-End',self._start_time_.strftime(self._strformat_), self._start_label_,self._final_end_time_.strftime(self._strformat_),'Total Time:', self._total_time_.total_seconds() ])

        df_lap_times = list2df(self._lap_times_list_,index_col='Lap #')
        print(f'Total Time: {_total_time_}.')
        if self._verbose_>1:
            return df_lap_times



    def lap(self, label=None):
        """Records time, duration, and label for current lap. Output display varies with clock verbose level.
        Calls .mark_lap_list() to document results in clock._list_lap_ times."""
        from datetime import datetime

        _end_time_ = self.get_time()

        # Append the lap attribute list and counter
        self._lap_label_ = label
        self._lap_end_time_ = _end_time_.strftime(self._strformat_)
        self._lap_counter_+=1
        self._lap_duration_ = (_end_time_ - self._prior_start_time_)
        # Now update the record
        self.mark_lap_list()

        # Now set next lap's new _prior_start
        self._prior_start_time_=_end_time_

        if self._verbose_>0:
            print(f'\tLap #{self._lap_counter_} done @ {self._lap_end_time_}\tlabel: {self._lap_label_:>{20}}\tduration: {self._lap_duration_.total_seconds()} sec)')


bs_ds v. 0.7.4 ... read the docs at https://bs-ds.readthedocs.io/en/latest/index.html
For convenient loading of standard modules :
>> from bs_ds.imports import *



[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\james\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [2]:
clock=Clock(verbose=2)

Clock created at 06/02/19 - 02:52:07 PM.
	Start: clock.tic()	Mark lap: clock.lap()	Stop: clock.toc()



In [3]:
import time


clock.tic('starting the process')
time.sleep(0.5)

clock.lap('lap 1 completed')
time.sleep(1.2)

clock.lap('lap 2 completed')
time.sleep(0.5)

clock.lap('lap 3')
time.sleep(0.7)

clock.toc('final')

Clock started at 06/02/19 - 02:52:42 PM
	Lap #1 done @ 06/02/19 - 02:52:43 PM	label:      lap 1 completed	duration: 0.500465 sec)
	Lap #2 done @ 06/02/19 - 02:52:44 PM	label:      lap 2 completed	duration: 1.200662 sec)
	Lap #3 done @ 06/02/19 - 02:52:45 PM	label:                lap 3	duration: 0.500148 sec)
	Lap #4 done @ 06/02/19 - 02:52:45 PM	label:                final	duration: 0.702204 sec)
Total Time: 0:00:02.903479.


Unnamed: 0_level_0,Start Time,Start Label,Stop Time,Stop Label,Duration
Lap #,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
1,06/02/19 - 02:52:42 PM,starting the process,06/02/19 - 02:52:43 PM,lap 1 completed,0.500465
2,06/02/19 - 02:52:43 PM,starting the process,06/02/19 - 02:52:44 PM,lap 2 completed,1.200662
3,06/02/19 - 02:52:44 PM,starting the process,06/02/19 - 02:52:45 PM,lap 3,0.500148
4,06/02/19 - 02:52:45 PM,starting the process,06/02/19 - 02:52:45 PM,final,0.702204
Start-End,06/02/19 - 02:52:42 PM,starting the process,06/02/19 - 02:52:45 PM,Total Time:,2.903479


In [19]:
# Extract tweets for top @'s, while matching the distribution of top @'s

filename = 'top_tweets_by_at.txt'

for q in query_ats[:3]:
    searchQuery = q[0]
    maxTweets = q[1]
    print(f'Query={searchQuery}, max={maxTweets}')
    search_twitter_api(searchQuery, maxTweets, fName=filename)
#     time.sleep(1)
    
# toc(t0)


Query=@realDonaldTrump, max=14999
Downloading max14999 for @realDonaldTrump...
Downloaded 81 tweets
Downloaded 151 tweets
Downloaded 229 tweets
Downloaded 317 tweets
Downloaded 391 tweets
Downloaded 456 tweets
Downloaded 528 tweets
Downloaded 597 tweets
Downloaded 677 tweets
Downloaded 749 tweets
Downloaded 821 tweets
Downloaded 892 tweets
Downloaded 964 tweets
Downloaded 1043 tweets
Downloaded 1121 tweets
Downloaded 1205 tweets
Downloaded 1275 tweets
Downloaded 1346 tweets
Downloaded 1421 tweets
Downloaded 1497 tweets
Downloaded 1573 tweets
Downloaded 1658 tweets
Downloaded 1744 tweets
Downloaded 1818 tweets
Downloaded 1899 tweets
Downloaded 1978 tweets
Downloaded 2061 tweets
Downloaded 2138 tweets
Downloaded 2209 tweets
Downloaded 2285 tweets
Downloaded 2367 tweets
Downloaded 2444 tweets
Downloaded 2520 tweets
Downloaded 2609 tweets
Downloaded 2695 tweets
Downloaded 2769 tweets
Downloaded 2843 tweets
Downloaded 2925 tweets
Downloaded 3010 tweets
Downloaded 3084 tweets
Downloaded 3157

___

In [23]:
df_tweets = pd.read_json('top_tweets_by_at_06022019.txt', lines=True)

In [24]:
df_tweets.head()

Unnamed: 0,contributors,coordinates,created_at,display_text_range,entities,extended_entities,favorite_count,favorited,full_text,geo,...,quoted_status,quoted_status_id,quoted_status_id_str,retweet_count,retweeted,retweeted_status,source,truncated,user,withheld_in_countries
0,,,2019-06-02 15:32:39,"[0, 140]","{'hashtags': [], 'symbols': [], 'urls': [], 'u...",,0,False,RT @realDonaldTrump: When you are the “Piggy B...,,...,,,,24356,False,"{'contributors': None, 'coordinates': None, 'c...","<a href=""http://twitter.com/download/iphone"" r...",False,"{'contributors_enabled': False, 'created_at': ...",
1,,,2019-06-02 15:32:39,"[31, 81]","{'hashtags': [], 'symbols': [], 'urls': [{'dis...",,0,False,"@realDonaldTrump @CNN @nytimes Here it is, you...",,...,,,,0,False,,"<a href=""http://twitter.com/download/iphone"" r...",False,"{'contributors_enabled': False, 'created_at': ...",
2,,,2019-06-02 15:32:39,"[31, 79]","{'hashtags': [], 'symbols': [], 'urls': [], 'u...",,0,False,@realDonaldTrump @CNN @nytimes You LITERALLY s...,,...,,,,0,False,,"<a href=""http://twitter.com/#!/download/ipad"" ...",False,"{'contributors_enabled': False, 'created_at': ...",
3,,,2019-06-02 15:32:39,"[0, 140]","{'hashtags': [], 'symbols': [], 'urls': [], 'u...",,0,False,RT @realDonaldTrump: People have been saying f...,,...,,,,12978,False,"{'contributors': None, 'coordinates': None, 'c...","<a href=""http://twitter.com/download/iphone"" r...",False,"{'contributors_enabled': False, 'created_at': ...",
4,,,2019-06-02 15:32:39,"[47, 47]","{'hashtags': [], 'media': [{'display_url': 'pi...",{'media': [{'display_url': 'pic.twitter.com/H5...,0,False,@KevinJacksonTBS @realDonaldTrump @CNN @nytime...,,...,,,,0,False,,"<a href=""http://twitter.com/download/iphone"" r...",False,"{'contributors_enabled': False, 'created_at': ...",


In [26]:
df_tweets['date_published'] = pd.to_datetime(df_tweets['created_at'])

In [28]:
df_tweets['date_published'].min(), df_tweets['date_published'].max()

(Timestamp('2019-05-23 18:12:37'), Timestamp('2019-06-02 16:30:29'))