## Analyzing Twitter Data Using Social Feed Manager

Social Feed Manager (SFM) collects individual posts–tweets, photos, blogs–from social
media sites. These posts are collected in their native, raw data format called JSON and
can be exported in many formats, including excel spreadsheets. In this document, we
will introduce the steps and methods we used in our project to extract useful information
from SFM. The goal is to get the frequency of hashtags and retweet users of the target
Twitter user from the Twitter data. The information of the frequency will be constructed
in dataframes with descending orders.

● A thorough user guide of SFM can be found here:

https://sfm.readthedocs.io/en/latest/userguide.html

## Define Libraries

In [None]:
# Import useful packages 
from collections import Counter
import pandas as pd

## Exploratory Data Analysis

In [None]:
# Read the excel file and store it in the df data frame
df = pd.read_excel('Tweets for Digital collection project 20210526.xlsx')

In [None]:
# Print out the number of rows and columns of the data frame
print(df.shape)

(16252, 37)


In [None]:
# Check for the total number of posts for each Twitter user
df['user_screen_name'].value_counts()

tsmullaney    3460
hild_de       3301
cfmeyskens    3299
AmandaUCSC    3251
ankeqiang     1504
vischina      1437
Name: user_screen_name, dtype: int64

## Analysis for Hashtags & Retweet Accounts

### Define the Two Input Variables: the target users and the number of observations per twitter account

In [None]:
# First To define/choose the target user(s) that you want to analyze
# Then you will redefine the value of the input in this cell
# You will also need to determine the number of observations you want for all target Twitter usrs
input_users = ['tsmullaney','hild_de']
number = 20

### Filtering the Frequency of using hashtags from the Users of Your Choice

In [None]:
# In this cell, you will run the following code to get your output of the frequecy of hashtags used from the user(s) you selected
# first read through the inputs and find the users selected 
df_hashtag = df[df['user_screen_name'].isin(input_users)]

# Drop all NAs and convert all hashtags to lower cases 
df_hashtag = df_hashtag[df_hashtag['hashtags'].notna()]
df_hashtag['hashtags'] = df_hashtag['hashtags'].str.lower()

# The following lines of code are for spliting multiple hashtags from a post
s = df_hashtag['hashtags'].str.split(' ').apply(pd.Series, 1).stack()
s.index = s.index.droplevel(-1)
s.name = 'hashtags' # needs a name to join
del df_hashtag['hashtags']
final = df_hashtag.join(s)

# It group the data frame by the following 2 columns, user_screen_name and hashtags
result = final.groupby(['user_screen_name', 'hashtags']).size().to_frame(name = 'Count').reset_index()

# This sorts the dataframe again with desceding order and only keeps the top 30 observations per Twitter user
result = result.sort_values(['user_screen_name','Count'],ascending=False).groupby('user_screen_name').head(number).reset_index()
hashtag_result = result.drop('index', axis=1)

print("The Top", number, "Most Frequently Used Hashtags from", input_users, ": \n", hashtag_result)

The Top 20 Most Frequently Used Hashtags from ['tsmullaney', 'hild_de'] : 
    user_screen_name              hashtags  Count
0        tsmullaney          computerfire     31
1        tsmullaney       academictwitter     21
2        tsmullaney              firstgen     19
3        tsmullaney               covid19     10
4        tsmullaney          aftereffects      6
5        tsmullaney     firstgenprofessor      6
6        tsmullaney                flotus      6
7        tsmullaney                tiktok      6
8        tsmullaney               chinese      5
9        tsmullaney      firstgenacademic      4
10       tsmullaney            humblebrag      4
11       tsmullaney              stanford      4
12       tsmullaney             computing      3
13       tsmullaney       humanities4ever      3
14       tsmullaney  yourcomputerisonfire      3
15       tsmullaney              academia      2
16       tsmullaney      blacklivesmatter      2
17       tsmullaney          boycottmulan 

### Output the result of the Hashtag dataframe as a CSV file

In [None]:
#hashtag_result.to_csv('Most_Hashtags.csv')

### Filtering the Frequency of Retweeting Twitter Accounts on the Users of Your Choice

In [None]:
# In this cell, you will follow the similar steps from above to get your output of the frequecy of retweets used from the user(s) you selected
# first, in case you have different input of users, you will redefine the inputs and find the users selected 
df_retweet = df[df['user_screen_name'].isin(input_users)]

# Drop all NAs 
df_retweet = df_retweet[df_retweet['retweet_or_quote_screen_name'].notna()]

# It group the data frame by the following 2 columns, user_screen_name and retweet_or_quote_screen_name
result = df_retweet.groupby(['user_screen_name', 'retweet_or_quote_screen_name']).size().to_frame(name = 'Count').reset_index()

# This sorts the dataframe again with desceding order and only keeps the top 30 observations per Twitter user
result = result.sort_values(['user_screen_name','Count'],ascending=False).groupby('user_screen_name').head(number).reset_index()
retweet_result = result.drop('index', axis=1)

print("The Top", number, "Most Frequently Used Retweet Accounts from", input_users, ": \n", retweet_result)

The Top 20 Most Frequently Used Retweet Accounts from ['tsmullaney', 'hild_de'] : 
    user_screen_name retweet_or_quote_screen_name  Count
0        tsmullaney                   tsmullaney    131
1        tsmullaney                     mitpress     16
2        tsmullaney                   palumboliu     12
3        tsmullaney                   histoftech      9
4        tsmullaney                      nytimes      9
5        tsmullaney                     jwassers      8
6        tsmullaney                      latimes      8
7        tsmullaney                     bjpeters      6
8        tsmullaney                 TexasTribune      5
9        tsmullaney               cloudquistador      5
10       tsmullaney              EastAsiaSciTech      4
11       tsmullaney                 RadicalAIPod      4
12       tsmullaney                 geekgalgroks      4
13       tsmullaney                     icpetrie      4
14       tsmullaney                kellyahammond      4
15       tsmullaney 

### Output the result of the retweet dataframe as a CSV file

In [None]:
#retweet_result.to_csv('Most_Retweet_Accounts.csv')

## Second Approach: the old method which is slower that the first method

### Filtering the Frequency on User: tsmullaney



In [None]:
df_tsmullaney = df[df['user_screen_name']=='tsmullaney']

lst_tsmullaney = df_tsmullaney['hashtags'].dropna().str.lower().to_list()
retweet_screen_name_tsmullaney = df_tsmullaney['retweet_or_quote_screen_name'].dropna().to_list()

lst_tsmullaney_final = []
for i in range(len(lst_tsmullaney)):
    word = lst_tsmullaney[i].split(' ')
    lst_tsmullaney_final += word

lst_tsmullaney_final = pd.DataFrame(Counter(lst_tsmullaney_final).most_common(30),
                    columns=['Word', 'Frequency']).set_index('Word')

retweet_screen_name_list_tsmullaney = pd.DataFrame(Counter(retweet_screen_name_tsmullaney).most_common(30),
                    columns=['Retweet_Screen_Name', 'Frequency']).set_index('Retweet_Screen_Name')

print("The Top 30 Most Frequently Used Hashtags: \n", lst_tsmullaney_final)
print("\n")
print("The Top 30 Most Frequently Users to Retweet from: \n", retweet_screen_name_list_tsmullaney)

The Top 30 Most Frequently Used Hashtags: 
                       Frequency
Word                           
computerfire                 31
academictwitter              21
firstgen                     19
covid19                      10
tiktok                        6
firstgenprofessor             6
flotus                        6
aftereffects                  6
chinese                       5
stanford                      4
firstgenacademic              4
humblebrag                    4
computing                     3
yourcomputerisonfire          3
humanities4ever               3
chinesecomputer               2
cdcforum                      2
comics                        2
gradschool                    2
writing                       2
firstgenprof                  2
phd                           2
academia                      2
scottatlas                    2
boycottmulan                  2
blacklivesmatter              2
guggenheim                    2
h1b                         

### Filtering the Frequency on User: hild_de

In [None]:
df_hild_de = df[df['user_screen_name']=='hild_de']

lst_hild_de = df_hild_de['hashtags'].dropna().str.lower().to_list()
retweet_screen_name_hild_de= df_hild_de['retweet_or_quote_screen_name'].dropna().to_list()

retweet_screen_name_list_hild_de = pd.DataFrame(Counter(retweet_screen_name_hild_de).most_common(30),
                    columns=['Retweet_Screen_Name', 'Frequency']).set_index('Retweet_Screen_Name')

lst_hild_de_final = []

for i in range(len(lst_hild_de)):
    word = lst_hild_de[i].split(' ')
    lst_hild_de_final += word

lst_hild_de_final = pd.DataFrame(Counter(lst_hild_de_final).most_common(30),
                    columns=['Word', 'Frequency']).set_index('Word')
                    
print("The Top 30 Most Frequently Used Hashtags: \n", lst_hild_de_final)
print("\n")
print("The Top 30 Most Frequently Users to Retweet from: \n", retweet_screen_name_list_hild_de)

The Top 30 Most Frequently Used Hashtags: 
                    Frequency
Word                        
medieval                 174
dh                       102
highered                 102
chinese                   64
china                     63
earlymodern               56
history                   56
markus                    51
digitalhistory            37
xinjiang                  37
digitalhumanities         36
woinactie                 35
openaccess                31
hongkong                  28
leiden                    26
coronavirus               26
dhasia                    23
maps                      22
hk                        22
asia                      22
cartography               20
taiwan                    20
covid19                   19
art                       18
encyclopedias             17
education                 16
aas2019                   16
libraries                 15
publishing                15
humanities                15


The Top 30 Most Frequently

### Filtering the Frequency on User: cfmeyskens

In [None]:
df_cfmeyskens = df[df['user_screen_name']=='cfmeyskens']

lst_cfmeyskens = df_cfmeyskens['hashtags'].dropna().str.lower().to_list()
retweet_screen_name_cfmeyskens= df_cfmeyskens['retweet_or_quote_screen_name'].dropna().to_list()

retweet_screen_name_list_cfmeyskens = pd.DataFrame(Counter(retweet_screen_name_cfmeyskens).most_common(30),
                    columns=['Retweet_Screen_Name', 'Frequency']).set_index('Retweet_Screen_Name')

lst_cfmeyskens_final = []

for i in range(len(lst_cfmeyskens)):
    word = lst_cfmeyskens[i].split(' ')
    lst_cfmeyskens_final += word

lst_cfmeyskens_final = pd.DataFrame(Counter(lst_cfmeyskens_final).most_common(30),
                    columns=['Word', 'Frequency']).set_index('Word')

print("The Top 30 Most Frequently Used Hashtags: \n", lst_cfmeyskens_final)
print("\n")
print("The Top 30 Most Frequently Users to Retweet from: \n", retweet_screen_name_list_cfmeyskens)

The Top 30 Most Frequently Used Hashtags: 
                     Frequency
Word                         
china                      17
coronavirus                 8
covid19                     6
taiwan                      4
aas2020                     4
gbleearchive                3
envhist                     2
twitterstorians             2
politics                    2
coldwar                     2
xinjiang                    2
landwars                    2
hongkong                    2
gbleearchive40              2
covid_19                    2
asianow                     2
s                           2
coronaravirus               2
hunan                       1
韶山                          1
shaoshan                    1
dccs                        1
olympics                    1
oscar                       1
chinalocalglobal            1
pornography                 1
jas                         1
蘋果日報                        1
chinareformmonitor          1
chinaafrica               

### Filtering the Frequency on User: AmandaUCSC

In [None]:
df_AmandaUCSC = df[df['user_screen_name']=='AmandaUCSC']

lst_AmandaUCSC = df_AmandaUCSC['hashtags'].dropna().str.lower().to_list()
retweet_screen_name_AmandaUCSC= df_AmandaUCSC['retweet_or_quote_screen_name'].dropna().to_list()

retweet_screen_name_list_AmandaUCSC = pd.DataFrame(Counter(retweet_screen_name_AmandaUCSC).most_common(30),
                    columns=['Retweet_Screen_Name', 'Frequency']).set_index('Retweet_Screen_Name')

lst_AmandaUCSC_final = []

for i in range(len(lst_AmandaUCSC)):
    word = lst_AmandaUCSC[i].split(' ')
    lst_AmandaUCSC_final += word

lst_AmandaUCSC_final = pd.DataFrame(Counter(lst_AmandaUCSC_final).most_common(30),
                    columns=['Word', 'Frequency']).set_index('Word')

print("The Top 30 Most Frequently Used Hashtags: \n", lst_AmandaUCSC_final)
print("\n")
print("The Top 30 Most Frequently Users to Retweet from: \n", retweet_screen_name_list_AmandaUCSC)

The Top 30 Most Frequently Used Hashtags: 
                      Frequency
Word                          
digitalasiaconf             71
aas2017                     55
china                       30
aas2016                     20
datech2017                  18
dh                          16
maoistlegacy                13
breaking                    13
digitalhumanities           12
chinese                     11
dhasia                      10
greennewdeal                 7
aha18                        7
update                       7
twitterstorians              6
sporthistory                 6
goldengamesnus               6
coldwarhist                  6
paris                        6
coldwarsport                 5
sportsdiplomacy              5
dh2018                       5
history                      5
ilooklikeahistorian          5
otd                          5
sport                        4
taiwan                       4
omeka10years                 4
markus                    

### Filtering the Frequency on User: ankeqiang

In [None]:
df_ankeqiang = df[df['user_screen_name']=='ankeqiang']

lst_ankeqiang = df_ankeqiang['hashtags'].dropna().str.lower().to_list()
retweet_screen_name_ankeqiang = df_ankeqiang['retweet_or_quote_screen_name'].dropna().to_list()

retweet_screen_name_list_ankeqiang = pd.DataFrame(Counter(retweet_screen_name_ankeqiang).most_common(30),
                    columns=['Retweet_Screen_Name', 'Frequency']).set_index('Retweet_Screen_Name')

lst_ankeqiang_final = []

for i in range(len(lst_ankeqiang)):
    word = lst_ankeqiang[i].split(' ')
    lst_ankeqiang_final += word

lst_ankeqiang_final = pd.DataFrame(Counter(lst_ankeqiang_final).most_common(30),
                    columns=['Word', 'Frequency']).set_index('Word')

print("The Top 30 Most Frequently Used Hashtags: \n", lst_ankeqiang_final)
print("\n")
print("The Top 30 Most Frequently Users to Retweet from: \n", retweet_screen_name_list_ankeqiang)

The Top 30 Most Frequently Used Hashtags: 
                    Frequency
Word                        
histchine6               567
ihn                      197
histtaiwan6               75
histchine                 56
histnum                   42
ccasioc                   42
sinq15                    36
eoaease                   24
histchine5                22
semmethodo                22
histtaiwan                18
enpchina                  15
histchina6                14
feedly                    14
digitalhumanities         11
sine14                    10
hspatial                   8
hustchine6                 3
china                      3
semmetho                   3
蘋果日報                       2
hkpolice                   2
demainamu                  2
ihnamu                     2
enp                        2
infographic                2
forbiddencity              2
palacemuseum               2
mustread                   2
virtshai                   2


The Top 30 Most Frequently

### Filtering the Frequency on User: vischina

In [None]:
df_vischina = df[df['user_screen_name']=='vischina']

lst_vischina = df_vischina['hashtags'].dropna().str.lower().to_list()
retweet_screen_name_vischina = df_vischina['retweet_or_quote_screen_name'].dropna().to_list()

retweet_screen_name_list_vischina = pd.DataFrame(Counter(retweet_screen_name_vischina).most_common(30),
                    columns=['Retweet_Screen_Name', 'Frequency']).set_index('Retweet_Screen_Name')

lst_vischina_final = []

for i in range(len(lst_vischina)):
    word = lst_vischina[i].split(' ')
    lst_vischina_final += word

lst_vischina_final = pd.DataFrame(Counter(lst_vischina_final).most_common(30),
                    columns=['Word', 'Frequency']).set_index('Word')

print("The Top 30 Most Frequently Used Hashtags: \n", lst_vischina_final)
print("\n")
print("The Top 30 Most Frequently Users to Retweet from: \n", retweet_screen_name_list_vischina)

The Top 30 Most Frequently Used Hashtags: 
                               Frequency
Word                                   
china                               219
chinesehistory                      217
history                             215
historicalphotos                    165
historicalphotographsofchina        156
shanghai                             64
beijing                              59
hongkong                             50
explorechina                         40
chineseculture                       37
tianjin                              35
travelchina                          34
tientsin                             31
oldchina                             30
photography                          30
historyphotographed                  29
shanghaihistory                      28
hongkonghistory                      24
historyinphotos                      24
photooftheday                        22
guangzhou                            20
beijinglife                         

### Combine All Users' Retweet Accounts Inforamtion into One 


In [None]:
# Combine the First Two Twitter users into one
frames = [retweet_screen_name_list_tsmullaney, retweet_screen_name_list_hild_de]

result = pd.merge(retweet_screen_name_list_tsmullaney, retweet_screen_name_list_hild_de, on=[retweet_screen_name_list_tsmullaney.index, retweet_screen_name_list_hild_de.index])

col = result.pop("Frequency_x")
result.insert(1, "Frequency_x", col)
result

Unnamed: 0,key_0,Frequency_x,key_1,Frequency_y
0,tsmullaney,131,AASAsianStudies,58
1,mitpress,16,hild_de,41
2,palumboliu,12,LeidenDH,35
3,nytimes,9,rensbod,26
4,histoftech,9,timeshighered,22
5,jwassers,8,jwassers,22
6,latimes,8,DID_ACTE,20
7,bjpeters,6,SCMPNews,19
8,cloudquistador,5,FairbankCenter,17
9,TexasTribune,5,pvierth,17


In [None]:
# Combine the Third and Fourth Twitter users into one
result_one = pd.merge(retweet_screen_name_list_cfmeyskens, retweet_screen_name_list_AmandaUCSC, on=[retweet_screen_name_list_cfmeyskens.index, retweet_screen_name_list_AmandaUCSC.index])

col = result_one.pop("Frequency_x")
result_one.insert(1, "Frequency_x", col)
result_one

Unnamed: 0,key_0,Frequency_x,key_1,Frequency_y
0,Dali_Yang,52,maoistlegacy,113
1,EvanFeigenbaum,34,relevantorgans,64
2,fravel,33,manwhohasitall,51
3,jwassers,25,AcademicsSay,46
4,vshih2,25,LetaHong,44
5,guo_xuguang,24,hild_de,40
6,JeremyJierong,15,halfthesky49,36
7,ehundman,15,jwassers,28
8,philip_thai,15,sunrisemvmt,27
9,adam_tooze,15,AFP,27


In [None]:
# Combine the Fifth and Sixth Twitter users into one
result_two = pd.merge(retweet_screen_name_list_ankeqiang, retweet_screen_name_list_vischina , on=[retweet_screen_name_list_ankeqiang.index, retweet_screen_name_list_vischina.index])

col = result_two.pop("Frequency_x")
result_two.insert(1, "Frequency_x", col)
result_two

Unnamed: 0,key_0,Frequency_x,key_1,Frequency_y
0,demosisto,3,bickers,52
1,Calligramien,3,tongbingxue,29
2,KongTsungGan,2,GardensOfChina,26
3,IAO_Lyon,2,chinafamilies,17
4,CentreChine,2,hongkonghistory,13
5,JELEO2T,2,chinarhyming,13
6,CulturalTaiwan,2,CraigClunas,11
7,ZhangZhulin,2,UoBrisHistory,10
8,Pr_Logos,1,ukpcn,9
9,ecritac,1,ClassicChina,9


In [None]:
dfs = [result, result_one, result_two]
result_1 = pd.concat(dfs, join='outer', axis=1)
result_1 = result_1.set_axis(['tsmullaney', 'Frequency', 'hild_de', 'Frequency', 'cfmeyskens', 'Frequency', 'AmandaUCSC', 'Frequency', 'ankeqiang', 'Frequency', 'vischina', 'Frequency'], axis='columns')
result_1

Unnamed: 0,tsmullaney,Frequency,hild_de,Frequency.1,cfmeyskens,Frequency.2,AmandaUCSC,Frequency.3,ankeqiang,Frequency.4,vischina,Frequency.5
0,tsmullaney,131,AASAsianStudies,58,Dali_Yang,52,maoistlegacy,113,demosisto,3,bickers,52
1,mitpress,16,hild_de,41,EvanFeigenbaum,34,relevantorgans,64,Calligramien,3,tongbingxue,29
2,palumboliu,12,LeidenDH,35,fravel,33,manwhohasitall,51,KongTsungGan,2,GardensOfChina,26
3,nytimes,9,rensbod,26,jwassers,25,AcademicsSay,46,IAO_Lyon,2,chinafamilies,17
4,histoftech,9,timeshighered,22,vshih2,25,LetaHong,44,CentreChine,2,hongkonghistory,13
5,jwassers,8,jwassers,22,guo_xuguang,24,hild_de,40,JELEO2T,2,chinarhyming,13
6,latimes,8,DID_ACTE,20,JeremyJierong,15,halfthesky49,36,CulturalTaiwan,2,CraigClunas,11
7,bjpeters,6,SCMPNews,19,ehundman,15,jwassers,28,ZhangZhulin,2,UoBrisHistory,10
8,cloudquistador,5,FairbankCenter,17,philip_thai,15,sunrisemvmt,27,Pr_Logos,1,ukpcn,9
9,TexasTribune,5,pvierth,17,adam_tooze,15,AFP,27,ecritac,1,ClassicChina,9


In [None]:
#result_1.to_csv('Most_Retweet_Accounts.csv')