# Social Network Analysis of the Missoula County Twitter Account

Note: this project is heavily influenced by Steve Hedden's Article __[_"How to download and visualize your Twitter network"_](https://towardsdatascience.com/how-to-download-and-visualize-your-twitter-network-f009dbbf107b)__
## Initiate and import required dependencies and APIs

In [1]:
import tweepy
import pymongo
import pandas as pd
import networkx as nx
import matplotlib.pyplot as plt
from community import community_louvain

# Get API Keys
file_path = 'keys.txt'
with open(file_path, 'r') as file:
    keys = file.readlines()

# Set up tweepy client
bearer_token = keys[1]
access_token = keys[2]
access_token_secret = keys[3]

api = tweepy.Client(bearer_token=bearer_token, wait_on_rate_limit=True, return_type=[dict])

## Primary Followers
### Data Acquisition
Using the Tweepy **Paginator** method we are able to request a user's followers, without needing to redo the call after every 1000 results.

In [6]:
user_id = ["3178252764"]
follower_list = []
for user in user_id:
    followers = []
    try:
        for response in tweepy.Paginator(api.get_users_followers, 3178252764, max_results=1000):
            for follower in response.data:
                followers.append(follower.id)
    except tweepy.TweepyException:
        print("error")
        continue
    follower_list.append(followers)

### Create DataFrame of the _primary followers_
After the DataFrame has been created, we export it to a .csv for further analysis


In [7]:
df = pd.DataFrame(columns=['source','target'])
df['source'] = follower_list[0]
df['target'] = 3178252764 # User ID of the Missoula County account

df.to_csv("edges.csv",index=False)

<img src="finalVis/pFollowers1.png" alt-text="primary-followers" title="Primary Followers">

## Secondary Followers
We reuse the code from the _primary followers_, but this time we iterate through the '**target**' column of the DataFrame. <br>
Once that is complete we check to see if all users have been queried, as it is possible that the code skipped users, or that the loops broke too soon, thus forcing us to restart the code.

Here we are already deciding to remove users with 0 followers, as they don't generate any usable connection that we might need for the analysis down the road.
If a user returns **`None`**, we skip that user as they have a private account, so we can't obtain any data off of them.

Here we are obtaining the user IDs of people that are following the queried user.

In [None]:
user_list = list(df['target'])
secondary_df = pd.DataFrame(columns=['source','target'])

for userID in user_list:
    print(userID)
    secondary_followers = []
    secondary_follower_list = []

    for response in tweepy.Paginator(api.get_users_followers, userID, max_results=1000):
        if response.data==None:
            break
        for follower in response.data:
            secondary_followers.append(follower.id)

    if len(secondary_followers)>0:
        secondary_follower_list.append(secondary_followers)
        temp = pd.DataFrame(columns=['source','target'])
        temp['target'] = secondary_follower_list[0]
        temp['source'] = userID
        secondary_df = secondary_df.append(temp)
        secondary_df.to_csv("sFollowers.csv")

### Verify all users have been queried

Once we have obtained a completed list of _seconday followers_, we check and verify that all users have been queried and are in the list/DataFrame/file.

No outputs = all users were queried.

In [9]:
sFollowers_df = pd.read_csv("sFollowers.csv")
sFollowers = list(sFollowers_df['source'])

for userID in user_list:
    if userID in sFollowers:
        break
    else:
        print(str(userID) + " is not in the list")

### Data Cleaning and Preliminary Analysis

After the data has been obtained, checked, and verified, we need to clean it, as there are too many datapoints for Gephi to handle.

Here we use NetworkX for the network analysis portion. Once we have converted the DataFrame into a graph, we run `G.number_of_nodes()`, it returns **1.280.569**, which is far too many to effectively analyze. Now we use `k_core()` to pare down the Graph to nodes with `degree > 10`, which is roughly the top 1% of users. Now when we run `G_tmp.number_of_nodes()`, we get **12.791**, a much more manageable number.

In [5]:
sFollowers_df = pd.read_csv("sFollowers/sFollowers.csv")
G = nx.from_pandas_edgelist(sFollowers_df, 'source', 'target')

In [7]:
G_tmp = nx.k_core(G, 10)

### Community Detection
Now that we have a Dataset with which we can work, we now run a community detection algorithm to split the Data into groups. We then turn the partition into a DataFrame, with columns '`names`' and '`group`'.

In [9]:
partition = community_louvain.best_partition(G_tmp)

partition1 = pd.DataFrame([partition]).T
partition1 = partition1.reset_index()
partition1.columns = ['names','group']

### Degree Centrality
Now we sort the Dataset using _degree centrality_, allowing us to see the most influential nodes of the Graph.

In [10]:
G_sorted = pd.DataFrame(sorted(G_tmp.degree, key=lambda x: x[1], reverse=True))
G_sorted.columns = ['names','degree']
dc = G_sorted
G_sorted.head()

Unnamed: 0,names,degree
0,16249481,7516
1,69365437,6296
2,17139587,5961
3,19188212,4374
4,21316568,4195


### Readability
For easier readability of the visualization, we need to obtain the Twitter Handles of the users. For the sake of clarity we only obtain the first **100** user handles, which coincides with the **100** most influential nodes.

In [None]:
user_list = list(G_sorted["names"])
sFollowersHandles_df = pd.DataFrame(columns=['Id','handle'])
sFollowersHandles = []
sFollowersHandles_list = []
user_df = []
user_list_df = []

for i in range(100):
    response = api.get_user(id=user_list[i])
    sFollowersHandles.append(response.data.username)
    user_df.append(user_list[i])
    print(str(user_list[i]) + ' ' + sFollowersHandles[i])
    sFollowersHandles_list.append(sFollowersHandles)
    user_list_df.append(user_df)

sFollowersHandles_df['handle'] = sFollowersHandles_list[0]
sFollowersHandles_df['Id'] = user_list_df[0]
sFollowersHandles_df.to_csv("sFollowers/sFollowersHandles.csv")

### Export
After the preliminary analysis and data cleaning, we know need to export the cleaned data for further analysis using Gephi. To do that we combine the sorted DataFrame with the Groups from the community detection algorithm, giving us a DataFrame of nodes. Then we convert the `G_tmp` NetworkX graph back into a Pandas Edgelist DataFrame. Then we export both DataFrames as `.csv` files.

In [10]:
combined = pd.merge(dc, partition1, how='left', left_on='names', right_on='names')
combined = combined.rename(columns={'names':'id'})

edges = nx.to_pandas_edgelist(G_tmp)
nodes = combined['id']

edges.to_csv("edges.csv")
combined.to_csv("nodes.csv")

<img src="finalVis/sFollowers.png" alt="data-viz" title="Viz" />

## Secondary Following
We reuse the code from the _primary followers_, but this time we iterate through the '**target**' column of the DataFrame. <br>
Once that is complete we check to see if all users have been queried, as it is possible that the code skipped users, or that the loops broke too soon, thus forcing us to restart the code.

Here we are already deciding to remove users with 0 followers, as they don't generate any usable connection that we might need for the analysis down the road.
If a user returns **`None`**, we skip that user as they have a private account, so we can't obtain any data off of them.

This is the exact same procedure as with the _secondary followers_, but this time we are obtaining the user IDs of who the queried user is following.

In [None]:
user_list = list(df['target'])
secondary_df = pd.DataFrame(columns=['source','target'])

for userID in user_list:
    print(userID)
    secondary_following = []
    secondary_following_list = []

    for response in tweepy.Paginator(api.get_users_following, userID, max_results=1000):
        if response.data==None:
            break
        for following in response.data:
            secondary_following.append(following.id)

    if len(secondary_following)>0:
        secondary_following_list.append(secondary_following)
        temp = pd.DataFrame(columns=['source','target'])
        temp['target'] = secondary_following_list[0]
        temp['source'] = userID
        secondary_df = secondary_df.append(temp)
        secondary_df.to_csv("sFollowing.csv")

### Verify all users have been queried

Once we have obtained a completed list of _seconday following_, we check and verify that all users have been queried and are in the list/DataFrame/file.

No outputs = all users were queried.

In [6]:
sFollowing_df = pd.read_csv("sFollowing/sFollowing.csv")
sFollowing = list(secondary_df['source'])

for userID in user_list:
    if userID in sFollowing:
        break
    else:
        print(str(userID) + " is not in the list")

### Data Cleaning and Preliminary Analysis

After the data has been obtained, checked, and verified, we need to clean it, as there are too many datapoints for Gephi to handle.

Here we use NetworkX for the network analysis portion. Once we have converted the DataFrame into a graph, we run `G.number_of_nodes()`, it returns **988.551**, which is far too many to effectively analyze. Now we use `k_core()` to pare down the Graph to nodes with `degree>15`. Now when we run `G_tmp.number_of_nodes()`, we get **14.778**, a much more manageable number.

In [35]:
sFollowing_df = pd.read_csv("sFollowing/sFollowing.csv")
G = nx.from_pandas_edgelist(sFollowing_df, 'source', 'target')

In [36]:
G_tmp = nx.k_core(G, 15)

### Community Detection
Now that we have a Dataset with which we can work, we now run a community detection algorithm to split the Data into groups. We then turn the partition into a DataFrame, with columns '`names`' and '`group`'.

In [37]:
partition = community_louvain.best_partition(G_tmp)

partition1 = pd.DataFrame([partition]).T
partition1 = partition1.reset_index()
partition1.columns = ['names','group']

### Degree Centrality
Now we sort the Dataset using _degree centrality_, allowing us to see the most influential nodes of the Graph.

In [38]:
G_sorted = pd.DataFrame(sorted(G_tmp.degree, key=lambda x: x[1], reverse=True))
G_sorted.columns = ['names','degree']
dc = G_sorted
G_sorted.head()

Unnamed: 0,names,degree
0,36167088,3235
1,17037258,2674
2,707475104,2642
3,22812156,2607
4,525005120,2552


### Readability
For easier readability of the visualization, we need to obtain the Twitter Handles of the users. For the sake of clarity we only obtain the first **100** user handles, which coincides with the **100** most influential nodes.

In [None]:
user_list = list(G_sorted["names"])
sFollowingHandles_df = pd.DataFrame(columns=['Id','handle'])
sFollowingHandles = []
sFollowingHandles_list = []

for i in range(100):
    response = api.get_user(id=user_list[i])
    sFollowingHandles.append("@" + response.data.username)
    print(str(user_list[i]) + ' ' + sFollowingHandles[i])
    sFollowingHandles_list.append(sFollowingHandles)

sFollowingHandles_df['handle'] = sFollowingHandles_list[0]
sFollowingHandles_df['Id'] = user_list[:100]
sFollowingHandles_df.to_csv("sFollowing/sFollowingHandles.csv")

### Export
After the preliminary analysis and data cleaning, we know need to export the cleaned data for further analysis using Gephi. To do that we combine the sorted DataFrame with the Groups from the community detection algorithm, giving us a DataFrame of nodes. Then we convert the `G_tmp` NetworkX graph back into a Pandas Edgelist DataFrame. Then we export both DataFrames as `.csv` files.

In [11]:
combined = pd.merge(dc, partition1, how='left', left_on='names', right_on='names')
combined = combined.rename(columns={'names':'id'})

edges = nx.to_pandas_edgelist(G_tmp)

edges.to_csv("sFollowing/edges.csv")
combined.to_csv("sFollowing/nodes.csv")

<img src="finalVis/sFollowing0.png" alt="data-viz" title="Viz" />

## Specific Groups

From the vizualisation of the _secondary followers_ we find that there are two groups who stand out: The green and pink community. A cursory overview of the most influential nodes reveals a theme surrounding police and the recently cancelled TV Series _*Live PD*_. 

> So who are these accounts following?<br>
> What is their sentiment?<br>
> Are there any notable accounts that should be on watch?

In [None]:
cop_df = pd.read_csv("copTwitter/copTwitterNodes.csv")
cop_df = cop_df.drop(columns=['Label','timeset','degree','group','componentnumber','strongcompnum','indegree','outdegree','Eccentricity','closnesscentrality','harmonicclosnesscentrality','betweenesscentrality'])

sFollowers_df = pd.read_csv("sFollowers/sFollowers.csv")
sFollowers_df = sFollowers_df.drop(columns=['Unnamed: 0'])

In [None]:
user_list = list(cop_df['Id'])
tertiary_df = pd.DataFrame(columns=['source','target'])

for userID in user_list:
    print(userID)
    temp = sFollowers_df[sFollowers_df['source'] == userID]
    tertiary_df = pd.concat([tertiary_df,temp],ignore_index=True)
    tertiary_df.to_csv("copTwitter/tcFollowers.csv")

In [49]:
G = nx.from_pandas_edgelist(tertiary_df,'source','target')

partition = community_louvain.best_partition(G)

partition1 = pd.DataFrame([partition]).T
partition1 = partition1.reset_index()
partition1.columns = ['names','group']

In [None]:
G_sorted = pd.DataFrame(sorted(G.degree, key=lambda x: x[1], reverse=True))
G_sorted.columns = ['names', 'degree']
dc = G_sorted
G_sorted.head()

In [None]:
user_list = list(G_sorted['names'])
tFollowerHandles_df = pd.DataFrame(columns=['names','Label'])
tFollowerHandles = []
tFollowerHandles_list = []

for i in range(100):
    response = api.get_user(id=user_list[i])
    tFollowerHandles.append("@" + response.data.username)
    print(str(user_list[i]) + ' ' + tFollowerHandles[i])
    tFollowerHandles_list.append(tFollowerHandles)

tFollowerHandles_df['Label'] = tFollowerHandles_list[0]
tFollowerHandles_df['names'] = user_list[:100]
tFollowerHandles_df.to_csv("copTwitter/userHandles.csv")

In [None]:
combined = pd.merge(dc, partition1, how='left', left_on='names', right_on='names')
combined = pd.merge(combined, tFollowerHandles_df, how='left', left_on='names', right_on='names')
combined = combined.rename(columns={'names': 'Id','Label_y': 'Label'})

edges = nx.to_pandas_edgelist(G)

edges.to_csv("copTwitter/edges.csv")
combined.to_csv("copTwitter/nodes.csv")

### Preliminary Results
#### Notable Accounts
**Group 0**<br>
__[`@sgilks2166`](https://twitter.com/sgilks2166)__<br>
__[`@CelataKaren`](https://twitter.com/CelataKaren)__<br>
__[`@PermdogsSandy`](https://twitter.com/PermdogsSandy)__<br>
__[`@MommyFayeee`](https://twitter.com/MommyFayeee)__<br>
__[`@HallfordJeannie`](https://twitter.com/HallfordJeannie)__<br>
__[`@TannamiaHall`](https://twitter.com/TannamiaHall)__<br>
__[`@Jeffok16`](https://twitter.com/Jeffok16)__<br>
__[`@Michael_KE7MT`](https://twitter.com/Michael_KE7MT)__<br>
__[`@Mickey19741`](https://twitter.com/Mickey19741)__<br>
**Group 28**<br>
__[`@BIGRED476`](https://twitter.com/BIGRED476)__<br>
__[`@Matt33822937`](https://twitter.com/Matt33822937)__<br>
**Group 8**<br>
__[`@resa2330`](https://twitter.com/resa2330)__<br>
__[`@rebelbrat71`](https://twitter.com/rebelbrat71)__<br>
**Group 10**<br>
__[`@OedekovenTerry`](https://twitter.com/OedekovenTerry)__<br>
**Group 23**<br>
__[`@johnsons_nc`](https://twitter.com/johnsons_nc)__<br>
**Group 15**<br>
__[`@TXCoffeeSlinger`](https://twitter.com/TXCoffeeSlinger)__<br>
**Group 21**<br>
__[`@tambarry`](https://twitter.com/tambarry)__<br>

<img src="finalVis/tcFollowers0.png" alt-text="cop-data-viz" title="Cop Twitter Data Visualization">

In [None]:
notable_accounts = ['sgilks2166','CelataKaren','PermdogsSandy','MommyFayeee','HallfordJeannie','TannamiaHall','Jeffok16','Michael_KE7MT','Mickey19741','BIGRED476','Matt33822937','resa2330','rebelbrat71','OedekovenTerry','johnsons_nc','TXCoffeeSlinger','tambarry']
