## Modify configuration file

All of your Twitter API tokens and keys, and Twitter screen name and password are stored in a file called *scripts/config_{your_name}.py*.  We give you a template file called *scripts/config.py* in the repo.  Change the name of this file to *scripts/config_{your_name}.py*, as we do with homework files.  Then put your Chrome driver path, Twitter username and password, along with your Twiiter API credentials are in the *scripts/config_{your_name}.py* file.  

You can find the API credentials for your Twitter API account here: https://developer.twitter.com/en/apps.  Click on *Details* for your app, and then *Keys and Tokens*. 

The Twitter API credentials are called

1. `APP_KEY`

2. `APP_SECRET`

3. `OAUTH_TOKEN`

4. `OAUTH_TOKEN_SECRET`

The Twitter login info is called

1. `USER`

2. `PASSWORD`

The Chrome drive path is called

1. `DRIVER_PATH`

I recommend the `DRIVER_PATH` be something like `DRIVER_PATH = 'scripts/chromedriver_win32/chromedriver.exe'`  Basically, create a folder in the *scripts* folder and put the drive .exe file in there.

## Install packages

We will need:
1. `twython` - this package lets you connect to the Twitter API. 

2. `selenium` - this package lets you crawl websites.

3. `chromedriver_autoinstaller` - this package installs the chrome driver you download.

4. Chrome driver - this is a software that lets us do the webcrawling with `selenium`. You have to download a Chrome driver from https://chromedriver.chromium.org/downloads.  Check to make sure your driver matches your version of Chrome.

In [None]:
!pip install twython --upgrade --user
!pip install selenium --upgrade --user
!pip install chromedriver_autoinstaller --upgrade --user

#You need to download your chrome driver too

## Import packages

We will import the packages we installed, along with some helper functions.

In [1]:
from twython import Twython
from datetime import datetime, timedelta
import numpy as np
import sqlite3, sys, os
import pandas as pd
import networkx as nx
import matplotlib.pyplot as plt
import seaborn as sns
import codecs  #this let's us display tweets properly (emojis, etc.)

#helper code
import scripts.scraper_twitter_api as api


#### Import configuration file

Import your modified configuration file with the code `from scripts.config_{your_name} import *`

In [2]:
from scripts.config_lmdisch import *

# Collect Tweets by Keyword with the Twitter API

Next we will provide code to collect tweets that contain a keyword, or one of many in a set of keywords.

#### Connect to Twitter API

In [3]:
twitter = Twython(APP_KEY, APP_SECRET,OAUTH_TOKEN, OAUTH_TOKEN_SECRET);
print("Connection made to Twitter API for "+twitter.verify_credentials()['screen_name'])


Connection made to Twitter API for lmdisch


#### Create list of query keywords

Create a list `keywords` that has all the words you want to search for.

In [4]:
keywords = ['boycottchina', 'CCPvirus', 'chinavirus', 'chinesevirus', 'chinaliedpeopledied', 
            'ChinaEnemyToTheWorld', 'kungflu', 'wuhanvirus', 'xivirus', 'beijingbiden', 'bidenvirus',
            'CovidHOAX', 'PLANdemic', 'Scamdemic', 'MedicalMartialLaw', 'stopaapihate', 'stopasianhate', 
            'stopasianhatecrimes', 'racism', 'asianstrong', 'asianpride', 'asianamerican', 'webelonghere', 
            'protectourelders', 'aapi', 'standforasians', 'wearenotavirus', 'indianvariant', 'southafricanvariant', 
            'britishvariant']

#### Collect tweets for each keyword

The tweets will be saved in a database file with name given by `fname`.  If you run this cell again with the same filename, new tweets will be added to the database.

In [13]:
fname = f"data/final_project_tweets.db"
max_tweets = 1000
df = api.keyword_tweets(twitter ,keywords,fname,max_tweets = max_tweets)

Tweets will be saved to database data/final_project_tweets.db
Querying keyword  boycottchina
Insterting final batch of tweets. got  1012  to insert
Querying keyword  CCPvirus
Insterting final batch of tweets. got  1055  to insert
Querying keyword  chinavirus
Insterting final batch of tweets. got  1017  to insert
Querying keyword  chinesevirus
Insterting final batch of tweets. got  1067  to insert
Querying keyword  chinaliedpeopledied
No more results
Insterting final batch of tweets. got  859  to insert
Querying keyword  ChinaEnemyToTheWorld
	Too many requests, go sleep for 15 minutes
	Will start insterting tweets in the meantime, got  0  to insert
No more results
Insterting final batch of tweets. got  284  to insert
Querying keyword  kungflu
No more results
Insterting final batch of tweets. got  152  to insert
Querying keyword  wuhanvirus
Insterting final batch of tweets. got  1038  to insert
Querying keyword  xivirus
No more results
Insterting final batch of tweets. got  6  to insert


#### Load database into a dataframe

We create a connection `conn` to the database and then load the data into a dataframe with the `read_sql_query` function.

In [10]:
conn = sqlite3.connect(fname)
df = pd.read_sql_query("SELECT * FROM tweet", conn)

print(f"df has {len(df)} rows")
print(f"df has columns {df.columns}")

df has 42704 rows
df has columns Index(['tweet_id', 'user_id', 'screen_name', 'created_at', 'text', 'geo_lat',
       'geo_long', 'place_type', 'place_name', 'lang', 'source',
       'retweet_count', 'favorite_count', 'retweet_status_id',
       'reply_to_status_id', 'reply_to_user_id', 'reply_to_screen_name'],
      dtype='object')


#### Look at top retweeted tweets

For fun, let's print out the top retweeted tweets.

In [11]:
ndisplay = 10
c = 0
for index, row in df.sort_values(by = ['retweet_count'],ascending = False).iterrows():
    c+=1
    text = codecs.decode(row.text, 'unicode_escape')
    print(f"{row.retweet_count} retweets: @{row.screen_name}: {text}")
    if c>=ndisplay:break

1014086 retweets: @cyycoby: RT @BTS_twt: #StopAsianHate
#StopAAPIHate https://t.co/mOmttkOpOt 
1014086 retweets: @Kpopermoon1: RT @BTS_twt: #StopAsianHate
#StopAAPIHate https://t.co/mOmttkOpOt 
1014086 retweets: @Arely_twt: RT @BTS_twt: #StopAsianHate
#StopAAPIHate https://t.co/mOmttkOpOt 
1014086 retweets: @Tho94375515: RT @BTS_twt: #StopAsianHate
#StopAAPIHate https://t.co/mOmttkOpOt 
1014086 retweets: @aseret1111: RT @BTS_twt: #StopAsianHate
#StopAAPIHate https://t.co/mOmttkOpOt 
1014086 retweets: @psyqchoo: RT @BTS_twt: #StopAsianHate
#StopAAPIHate https://t.co/mOmttkOpOt 
1014086 retweets: @_jiminsquishy: RT @BTS_twt: #StopAsianHate
#StopAAPIHate https://t.co/mOmttkOpOt 
1014086 retweets: @busanfriendboys: RT @BTS_twt: #StopAsianHate
#StopAAPIHate https://t.co/mOmttkOpOt 
1014086 retweets: @BANGTAN88631746: RT @BTS_twt: #StopAsianHate
#StopAAPIHate https://t.co/mOmttkOpOt 
1014086 retweets: @thu_tra_2k3: RT @BTS_twt: #StopAsianHate
#StopAAPIHate https://t.co/mOmttkOpOt 


In [12]:
df.head()

Unnamed: 0,tweet_id,user_id,screen_name,created_at,text,geo_lat,geo_long,place_type,place_name,lang,source,retweet_count,favorite_count,retweet_status_id,reply_to_status_id,reply_to_user_id,reply_to_screen_name
0,1381700240030498816,1258672380689485827,BianchiDaniele8,2021-04-12 20:06:22,@lemondefr Quel chauvinisme de merde. Le const...,,,,,fr,"<a href=""http://twitter.com/download/android"" ...",0,0,,1.381608e+18,24744541.0,lemondefr
1,1381691539546640389,978319285838860288,tonydll8,2021-04-12 19:31:48,Boicottare i prodotti cinesi \xe8 giusto perch...,,,,,it,"<a href=""https://mobile.twitter.com"" rel=""nofo...",0,0,,,,
2,1381687827822444552,86651427,IRISH_BUILT905,2021-04-12 19:17:03,#ww2 #ww3 #BoycottChina history has ways of du...,,,,,en,"<a href=""http://twitter.com/download/iphone"" r...",0,0,,,,
3,1381680897288638466,1353806891000844293,ChuWestBaGa1,2021-04-12 18:49:30,@EpochTimes @EpochTimesHK #China_is_terrorist ...,,,,,en,"<a href=""https://mobile.twitter.com"" rel=""nofo...",0,1,,1.381671e+18,29097819.0,EpochTimes
4,1381677812747239427,1218157218253557760,Cindy01515332,2021-04-12 18:37:15,"RT @JilLye3: In Sagaing,\nYouths held #Boycott...",,,,,en,"<a href=""http://twitter.com/download/android"" ...",47,0,1.381454e+18,,,


# Collecting Following Networks with Web Crawlers

The `followers` and `following` modules contain functions to collect the followers and following of users using a web crawler.  We don't use the Twitter API because it is incredibly slow for collecting network data.

When building your networks, it is easier to use the `following` module.  This way you avoid getting stuck on someone with 100 million followers.


#### Modify followers module.

The `following` and `followers` modules need to import your Twitter user name and password from the configuration file.  Since you will change the name of this file to *config_{your_name}.py*, you need to change the import line in *following.py* and *followers.py* from `from scripts.config import *` to  `from scripts.config_{your_name} import *`.  Then run the code `import scripts.following as Following` and `import scripts.followers as Followers`.

*ANNOYING FACT*: Each time you hard reset your repo, the *followers.py* and *following.py* files are overwritten with the version on the repo.  This means you have to change this import line each time you do a hard reset.  If you are clever, maybe you can rename the files and find a way to import them. 

 

In [None]:
import scripts.following as Following
import scripts.followers as Followers
import pandas as pd
import networkx as nx

#### Collect the following of a list of users.

Create a list `screen_name` of all the screen names you want to collect following for.  The function `Following.Network.multi_fetch` will collect the following for each screen name.  This data is returned as a dataframe `df`, whose columns are `screen_name` and `following`.

In [None]:
%%time
screen_names = ["JoeBiden","JanetYellen","POTUS","SecDef","KamalaHarris","DrBiden", "BarackObama"]
df = Following.Network.multi_fetch(users=screen_names,max_count = 500)

for index,row in df.iterrows():
    print(f"{row.screen_name}: {len(row.following)} following")

df.head()   

#### Create networkx object

This code creates a networkx object `G` from `df`.

In [None]:
G = nx.DiGraph()
for index,row in df.iterrows():
    u = row.screen_name
    G.add_node(u)
    for v in row.following:
        if v in df.screen_name.tolist():
            G.add_edge(v,u)
            
print(f"Network has {G.number_of_nodes()} nodes, {G.number_of_edges()} edges")


#### Save your network

Save the networkx object `G` to a pickle file with name given by `fname` using the `write_gpickle` function.

In [None]:
fname = 'data/network_following_biden.pickle'
nx.write_gpickle(G,fname)

#### Load network and draw it

Just to make sure we did everything correctly, load the network using the `read_gpickle` function, and draw it.

In [None]:
G = nx.read_gpickle(fname)


In [None]:
def draw_network_pos(G,pos,title_str):
    node_size = 100
    node_color = "pink"
    width = 0.5
    edge_color = "white"
    bg_color = "black"

    #2 points  drawing network with directed layout 
    fig = plt.figure(figsize= (8,6))
    plt.subplot(1,1,1)
    nx.draw(G, width=width,pos=pos ,node_color=node_color,
            edge_color=edge_color,node_size=node_size,
            connectionstyle='arc3',with_labels=True,font_color = 'white')
    plt.title(title_str,color = "white")
    fig.set_facecolor(bg_color)
    plt.show()    
    return 1

pos = nx.kamada_kawai_layout(G)
draw_network_pos(G,pos,"Biden Following Network")
