# <h1 style="text-align: center; text-decoration: underline">Creating & Identifying Variables</h1>
<br>
<br>
><p style='font-size: large'>We'll take a look at the hashtags represented in the HT-full dataset. We'll split original tweets and retweets into two separate dataframes, so that we can compare them side by side in the paper</p>


In [1]:
import pandas as pd
import numpy as np
from pandas import DataFrame
from pandas import Series 

In [2]:
pd.set_option('display.max_columns', None)

In [3]:
df = pd.read_pickle('./data/a_HT_no_retweets.pkl')
df.head(1)

Unnamed: 0,id,hash_select,tweet_id,inserted_date,truncated,language,possibly_sensitive,coordinates,retweeted_status,created_at_text,created_at,content,from_user_screen_name,from_user_id,from_user_followers_count,from_user_friends_count,from_user_listed_count,from_user_statuses_count,from_user_description,from_user_location,from_user_created_at,retweet_count,entities_urls,entities_urls_count,entities_hashtags,entities_hashtags_count,entities_mentions,entities_mentions_count,in_reply_to_screen_name,in_reply_to_status_id,source,entities_expanded_urls,json_output,entities_media_count,media_expanded_url,media_url,media_type,video_link,photo_link,twitpic
9,2555,%2523montanamoment,8.55e+17,35:57.3,0,en,0.0,,,Fri Apr 21 04:21:17 +0000 2017,21:17.0,The night sky is a fascinating place especiall...,LeonKauffman,425125748.0,673,152,39,1616,"Hydrologist, photographer, fan of Drexel baske...","Condon, Montana, USA",Wed Nov 30 15:58:25 +0000 2011,4,,0,MontanaMoment,1,,0,,,"<a href=""https://about.twitter.com/products/tw...",,"{u'contributors': None, u'truncated': False, u...",1.0,https://twitter.com/LeonKauffman/status/855275...,http://pbs.twimg.com/media/C96LaxcUQAEMOQl.jpg,photo,0,0,0


In [4]:
print len(df), "original tweets"
print len(df.columns), "variables"

2187 original tweets
40 variables


<h2> Export # of Followers Averaged by account before standardizing variables below</h2><br>
><p>  We'll create a new df using only those variables we're interested in, group by user and average their followers. Since each of the user-objects are relative to when a tweet was inserted into the dataset, there will be small changes in the number of followers an account has in association with its tweets in the dataset. This way, we'll be able to refer to accounts as having a single value for `followers_count` (which is really the average number of followers over the collection period).<br><br> We'll also export a .csv file, since it will be nice to have a non-standardized record that can be used to compare accounts based on number of followers.</p>

In [5]:
# This could also be done in a more 'Pythonic' way using the apply method
df_followers = df[['from_user_screen_name', 'from_user_followers_count']].groupby('from_user_screen_name')
df_followers = df_followers.mean()
df_followers = df_followers.sort_values('from_user_followers_count', ascending=False)
df_followers.head(10)

Unnamed: 0_level_0,from_user_followers_count
from_user_screen_name,Unnamed: 1_level_1
earthXplorer,183102.8125
GlacierNPS,181839.833333
MalloryOnTravel,101157.0
StephanieQuayle,81327.0
LuxuryTravel77,71775.5
visitmontana,57882.324074
ManTripping,57165.5
robertserian,54845.5
ECAatState,54770.0
StormHour,47673.0


In [6]:
# check that values make sense
df_compare = df[['from_user_screen_name', 'from_user_followers_count']].sort_values('from_user_followers_count', ascending=False)
df_compare.head(20)

Unnamed: 0,from_user_screen_name,from_user_followers_count
2492,GlacierNPS,191658
2945,GlacierNPS,187765
3072,GlacierNPS,187765
2966,GlacierNPS,187765
6138,earthXplorer,183605
5793,earthXplorer,183604
5748,earthXplorer,183604
5747,earthXplorer,183604
5675,earthXplorer,183604
5677,earthXplorer,183604


In [7]:
# Looks right, time to export
#df_followers.to_csv('Average followers by user-account - nonstandardized.csv', sep=',')

<h2>Create a new variable based on `from_user_screen_name`</h2>

<blockquote style="font: 14px">
We are interested in measuring the effect of account type upon user engagement, which we will measure using retweets. Thus, in our current tweet-level dataset, consisting only of original tweets, we will generate the binary variable  account_status and assign to it a zero if a tweet is sent from 'visitMontana'. For all other cases, that is, for accounts that are not the official Montana destination marketing organization, we will assign the account_status variable a one.
<br>
First, let's explore the from_user_screen_name a bit further to see how it has changed from the original dataset, which included all tweets (including retweets) that used the hashtag 'montanaMoment.'
</blockquote>

In [8]:
# Here is a list of the 546 users who were an original tweeter of 'montanaMoment' at some point between 12/24/16 and 04/17/17

print len(pd.unique(df.from_user_screen_name.ravel()))
pd.unique(df.from_user_screen_name.ravel())

546


array(['LeonKauffman', 'ErinWx', 'wxmissoula', 'lastbestbox',
       'ShannonAMay', 'UNLV_Sage', 'RMKK', 'CWPsPhotos', 'visitmontana',
       'BlackBullGolf', 'shawnnewton', 'EmilieRSaunders', 'KLeaguePhoto',
       'MTMVictoria', 'michelleroy', 'DancingAspens', 'WildReflections',
       'yellowstonegriz', 'mmcphoto', 'MTHist', 'BozemanBrewing',
       'montanalori', 'Melaniephurst', 'billingsgazette', 'Amy_Savannah',
       'myrnam71', 'MichaelBHodges', 'avaldez', 'AllPointsMT',
       'pinnaclemontana', 'abigaildennis', 'MerevinTweets', '1Blonde_Amy',
       'OrphanGirl_MT', 'timberveil', 'HazerLive', 'Suzie_OConnell',
       'AlmaDCastillo', 'carlygarrison87', 'BrikoUSA', 'edoornek',
       'KINSEYHD', 'CarrollCollege', 'stefferology', 'TravelDazeco',
       'nprofilm', 'ellesbee', 'adriansgphoto', 'CC_WxWitch',
       'codyedwards', 'MGCTwest', 'RockyTopSkiBum', 'Visitbigsky',
       'TariBKfan', 'MarketHook', 'Sabre_Moore', 'TailyrIrvine',
       'DrLimnology', 'Spark_Creative', '

<h3>Of this list of 546 users, 'theexceptionmag' is the top consumer of the hashtag with 318 uses for the period.</h3>


<blockquote>   
As discussed in the paper, this account is suspected to be a bot, and will be excluded in later regression analyses
</blockquote>

In [9]:
df['from_user_screen_name'].describe()

count                2187
unique                546
top       TheExceptionMag
freq                  318
Name: from_user_screen_name, dtype: object

In [10]:
df.head(1)

Unnamed: 0,id,hash_select,tweet_id,inserted_date,truncated,language,possibly_sensitive,coordinates,retweeted_status,created_at_text,created_at,content,from_user_screen_name,from_user_id,from_user_followers_count,from_user_friends_count,from_user_listed_count,from_user_statuses_count,from_user_description,from_user_location,from_user_created_at,retweet_count,entities_urls,entities_urls_count,entities_hashtags,entities_hashtags_count,entities_mentions,entities_mentions_count,in_reply_to_screen_name,in_reply_to_status_id,source,entities_expanded_urls,json_output,entities_media_count,media_expanded_url,media_url,media_type,video_link,photo_link,twitpic
9,2555,%2523montanamoment,8.55e+17,35:57.3,0,en,0.0,,,Fri Apr 21 04:21:17 +0000 2017,21:17.0,The night sky is a fascinating place especiall...,LeonKauffman,425125748.0,673,152,39,1616,"Hydrologist, photographer, fan of Drexel baske...","Condon, Montana, USA",Wed Nov 30 15:58:25 +0000 2011,4,,0,MontanaMoment,1,,0,,,"<a href=""https://about.twitter.com/products/tw...",,"{u'contributors': None, u'truncated': False, u...",1.0,https://twitter.com/LeonKauffman/status/855275...,http://pbs.twimg.com/media/C96LaxcUQAEMOQl.jpg,photo,0,0,0


<h2>`mtot variable`</h2><br>

<blockquote>
   Because we know that the account 'visitmontana' is a unique identifier of the state's official destination marketing Twitter
   account, we can use it to identify the tweets sent by MTOT that include the 'montanaMoment' hashtag.
   <br>
   <br>
   First, in step 1, we use the PANDAS match function to loop through the dataset and assign the value 'true' to each case of 'visitmontana' within the from_user_screen_name column of the dataset, and 'false' to all other cases. 
   <br>
   <br>
   We will then use the built-in astype() function to convert the logic into binary
</blockquote>

In [11]:
    # step 1
df['mtot'] = df['from_user_screen_name'].str.match('visitmontana', na=False)
df[['from_user_screen_name', 'mtot']].head()

Unnamed: 0,from_user_screen_name,mtot
9,LeonKauffman,False
10,ErinWx,False
11,wxmissoula,False
22,lastbestbox,False
24,ShannonAMay,False


In [12]:
    # step 2
df['mtot'] = df['mtot'].astype(int)
df[['from_user_screen_name', 'mtot', 'entities_hashtags']].head()

Unnamed: 0,from_user_screen_name,mtot,entities_hashtags
9,LeonKauffman,0,MontanaMoment
10,ErinWx,0,"sunset, Bitterroot, mtwx, Montana, MontanaMoment"
11,wxmissoula,0,"sunset, Bitterroot, mtwx, Montana, MontanaMoment"
22,lastbestbox,0,"montanamoment, lastbestbox, montana, montanahi..."
24,ShannonAMay,0,MontanaMoment


In [13]:
print "DATA CHECK:"
print len(df.columns), "variables"
print len(df), "tweets"
print "All good:", len(df.columns)==41 and len(df)==2187

DATA CHECK:
41 variables
2187 tweets
All good: True


In [14]:
print len(df[df['mtot'] == 1])
print len(df[df['mtot'] == 0])

print '\n', df.mtot.describe()

108
2079

count    2187.000000
mean        0.049383
std         0.216717
min         0.000000
25%         0.000000
50%         0.000000
75%         0.000000
max         1.000000
Name: mtot, dtype: float64


---
<br>
<h3> Create a dummy variable for @TheExceptionMag</h3>
<br>
Because TheExceptionMag has so many retweets, categorizing it with the `other_org` variable (organizations and businesses that are not individuals and are not associated with the state) would bias the effect size for any other category, i.e. an '`other_org` effect' would become a '`TheExceptionMag`' effect
<br>
<br>

In [15]:
# same process as the mtot variable
    # step 1
df['exception_mag'] = df['from_user_screen_name'].str.match('TheExceptionMag', na=False)
df[['from_user_screen_name', 'exception_mag']].head()

Unnamed: 0,from_user_screen_name,exception_mag
9,LeonKauffman,False
10,ErinWx,False
11,wxmissoula,False
22,lastbestbox,False
24,ShannonAMay,False


In [16]:
    # step 2
df['exception_mag'] = df['exception_mag'].astype(int)
df[['from_user_screen_name', 'exception_mag', 'mtot']].head()

Unnamed: 0,from_user_screen_name,exception_mag,mtot
9,LeonKauffman,0,0
10,ErinWx,0,0
11,wxmissoula,0,0
22,lastbestbox,0,0
24,ShannonAMay,0,0


In [17]:
print "DATA CHECK:"
print len(df.columns), "variables"
print len(df), "tweets"
print "All good:", len(df.columns)==42 and len(df)==2187

DATA CHECK:
42 variables
2187 tweets
All good: True


In [18]:
print len(df[df['exception_mag'] == 1])
print len(df[df['exception_mag'] == 0])
print '\n', df.exception_mag.describe()

318
1869

count    2187.000000
mean        0.145405
std         0.352592
min         0.000000
25%         0.000000
50%         0.000000
75%         0.000000
max         1.000000
Name: exception_mag, dtype: float64


---
<br>
<h3> Create a dummy variable for LeonKauffman</h3>
<br>
Because LeonKauffman has so many retweets, categorizing it with the `bus_int_ind` variable (private individuals with a business interest) would bias the effect size for this category, i.e. an '`bus_int_ind` effect' would become the '`LeonKauffman` effect'
<br>
<br>

In [19]:
# same process as the mtot and exception_mag variables
    # step 1
df['kauffman'] = df['from_user_screen_name'].str.match('LeonKauffman', na=False)
df[['from_user_screen_name', 'kauffman']].head()

Unnamed: 0,from_user_screen_name,kauffman
9,LeonKauffman,True
10,ErinWx,False
11,wxmissoula,False
22,lastbestbox,False
24,ShannonAMay,False


In [20]:
    # step 2
df['kauffman'] = df['kauffman'].astype(int)
df[['from_user_screen_name', 'exception_mag', 'mtot', 'kauffman']].head()

Unnamed: 0,from_user_screen_name,exception_mag,mtot,kauffman
9,LeonKauffman,0,0,1
10,ErinWx,0,0,0
11,wxmissoula,0,0,0
22,lastbestbox,0,0,0
24,ShannonAMay,0,0,0


In [21]:
print "DATA CHECK:"
print len(df.columns), "variables"
print len(df), "tweets"
print "All good:", len(df.columns)==43 and len(df)==2187

DATA CHECK:
43 variables
2187 tweets
All good: True


In [22]:
print len(df[df['kauffman'] == 1])
print len(df[df['kauffman'] == 0])
print '\n', df.kauffman.describe()

101
2086

count    2187.000000
mean        0.046182
std         0.209929
min         0.000000
25%         0.000000
50%         0.000000
75%         0.000000
max         1.000000
Name: kauffman, dtype: float64


---
<br>
<h2>Use <i>retweet_count</i> variable to generate <i>retweet_dummy</i> binary variable</h3>


In [23]:
# the retweet variable = 0 if the retweet_count var == 0, and 1 otherwise
# keep in mind, these are only for the orginal tweets (i.e., the 'causal' base of any later retweets/favorites)

df['retweet_dummy'] = np.where(df['retweet_count']==0, 0, 1)
df[['from_user_screen_name', 'retweet_dummy']].head()

Unnamed: 0,from_user_screen_name,retweet_dummy
9,LeonKauffman,1
10,ErinWx,0
11,wxmissoula,0
22,lastbestbox,0
24,ShannonAMay,0


In [24]:
print "DATA CHECK:"
print len(df.columns), "variables"
print len(df), "tweets"
print "All good:", len(df.columns)==42 and len(df)==2187

DATA CHECK:
44 variables
2187 tweets
All good: False


In [25]:
print len(df[df['retweet_dummy'] == 1]), "original tweets were retweeted"
print len(df[df['retweet_dummy'] == 0]), "original tweets were never retweeted"
print '\n', df.retweet_dummy.describe()

667 original tweets were retweeted
1520 original tweets were never retweeted

count    2187.000000
mean        0.304984
std         0.460505
min         0.000000
25%         0.000000
50%         0.000000
75%         1.000000
max         1.000000
Name: retweet_dummy, dtype: float64


---
<br>
## Create dummies based on url and mentions count
<br>
    <ol>
      <li>entities_mentions_count ==> mention_dummy</li>
      <li>entities_urls_count ==> url_dummy</li>
    </ol>

In [26]:
df['mention_dummy'] = np.where(df['entities_mentions_count']==0,0,1)
df['url_dummy'] = np.where(df['entities_urls_count']==0,0,1)

In [27]:
print "DATA CHECK:"
print len(df.columns), "variables"
print len(df), "tweets"
print "All good:", len(df.columns)==46 and len(df)==2187

DATA CHECK:
46 variables
2187 tweets
All good: True


---
<br>
## Variable Checks

>Now let's review all the variables we created

In [28]:
df[['from_user_screen_name', 'mtot', 'kauffman', 'exception_mag', 'retweet_count', 'retweet_dummy', 'entities_mentions_count', 'mention_dummy', 'entities_urls_count', 'url_dummy']].head(1)

Unnamed: 0,from_user_screen_name,mtot,kauffman,exception_mag,retweet_count,retweet_dummy,entities_mentions_count,mention_dummy,entities_urls_count,url_dummy
9,LeonKauffman,0,1,0,4,1,0,0,0,0


In [29]:
df['retweet_count'].describe()

count    2187.000000
mean        2.380430
std         8.033481
min         0.000000
25%         0.000000
50%         0.000000
75%         1.000000
max       136.000000
Name: retweet_count, dtype: float64

In [30]:
len(df[df['from_user_screen_name'] == 'visitmontana'])

108

In [31]:
df['retweet_dummy'].value_counts()

0    1520
1     667
Name: retweet_dummy, dtype: int64

In [32]:
df['mention_dummy'].value_counts()

0    1550
1     637
Name: mention_dummy, dtype: int64

In [33]:
df['url_dummy'].value_counts()

1    1328
0     859
Name: url_dummy, dtype: int64

In [34]:
df['mtot'].value_counts() 

0    2079
1     108
Name: mtot, dtype: int64

In [35]:
# the retweet_dummy variable has only been assigned 0 in cases where retweet_count was actually 0
pd.crosstab(df['retweet_count'], df['retweet_dummy'])

retweet_dummy,0,1
retweet_count,Unnamed: 1_level_1,Unnamed: 2_level_1
0,1520,0
1,0,219
2,0,98
3,0,63
4,0,40
5,0,32
6,0,24
7,0,20
8,0,14
9,0,9


In [36]:
pd.crosstab(df['entities_mentions_count'], df['mention_dummy'])

mention_dummy,0,1
entities_mentions_count,Unnamed: 1_level_1,Unnamed: 2_level_1
0,1550,0
1,0,531
2,0,53
3,0,23
4,0,21
5,0,7
6,0,2


In [37]:
pd.crosstab(df['entities_urls_count'], df['url_dummy'])

url_dummy,0,1
entities_urls_count,Unnamed: 1_level_1,Unnamed: 2_level_1
0,859,0
1,0,1315
2,0,13



<br>
## <div style='text-align: center; color: green'>Looks like everything checks out, moving on...</div>
<br>

<br>

---

# Create time variables

><p style='font-size: large'>This is similar to the time visualization in `2_tweets_retweets_by_time.ipynb`. Now, when we index the pandas datetime object, we'll use it to create variables</p>

<br>
<h2> By day of week</h2>

In [38]:
import calendar

In [39]:
df.dtypes[8:13]

retweeted_status         object
created_at_text          object
created_at               object
content                  object
from_user_screen_name    object
dtype: object

In [40]:
# convert to pandas datetime format
df['created_at_text'] = pd.to_datetime(df['created_at_text'])

In [41]:
df.dtypes[8:13]

retweeted_status                 object
created_at_text          datetime64[ns]
created_at                       object
content                          object
from_user_screen_name            object
dtype: object

In [42]:
df.created_at_text.describe()

count                    2187
unique                   2179
top       2017-04-05 05:20:08
freq                        2
first     2016-12-24 03:30:38
last      2017-04-21 04:21:17
Name: created_at_text, dtype: object

In [43]:
# set the index to the time variable
df = df.set_index(['created_at_text'])
print len(df)

2187


In [44]:
df['weekday'] = df.index.weekday # create a 'weekday' variable from a weekday-indexed df

In [45]:
print len(df['weekday'])

2187


In [46]:
# return the df index to the tweet_id variable, check the length, & notice the weekday variable now in the dataset (0 = Monday)
#df = df.set_index(['tweet_id'])
print len (df)
df.head(1)

2187


Unnamed: 0_level_0,id,hash_select,tweet_id,inserted_date,truncated,language,possibly_sensitive,coordinates,retweeted_status,created_at,content,from_user_screen_name,from_user_id,from_user_followers_count,from_user_friends_count,from_user_listed_count,from_user_statuses_count,from_user_description,from_user_location,from_user_created_at,retweet_count,entities_urls,entities_urls_count,entities_hashtags,entities_hashtags_count,entities_mentions,entities_mentions_count,in_reply_to_screen_name,in_reply_to_status_id,source,entities_expanded_urls,json_output,entities_media_count,media_expanded_url,media_url,media_type,video_link,photo_link,twitpic,mtot,exception_mag,kauffman,retweet_dummy,mention_dummy,url_dummy,weekday
created_at_text,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1,Unnamed: 23_level_1,Unnamed: 24_level_1,Unnamed: 25_level_1,Unnamed: 26_level_1,Unnamed: 27_level_1,Unnamed: 28_level_1,Unnamed: 29_level_1,Unnamed: 30_level_1,Unnamed: 31_level_1,Unnamed: 32_level_1,Unnamed: 33_level_1,Unnamed: 34_level_1,Unnamed: 35_level_1,Unnamed: 36_level_1,Unnamed: 37_level_1,Unnamed: 38_level_1,Unnamed: 39_level_1,Unnamed: 40_level_1,Unnamed: 41_level_1,Unnamed: 42_level_1,Unnamed: 43_level_1,Unnamed: 44_level_1,Unnamed: 45_level_1,Unnamed: 46_level_1
2017-04-21 04:21:17,2555,%2523montanamoment,8.55e+17,35:57.3,0,en,0.0,,,21:17.0,The night sky is a fascinating place especiall...,LeonKauffman,425125748.0,673,152,39,1616,"Hydrologist, photographer, fan of Drexel baske...","Condon, Montana, USA",Wed Nov 30 15:58:25 +0000 2011,4,,0,MontanaMoment,1,,0,,,"<a href=""https://about.twitter.com/products/tw...",,"{u'contributors': None, u'truncated': False, u...",1.0,https://twitter.com/LeonKauffman/status/855275...,http://pbs.twimg.com/media/C96LaxcUQAEMOQl.jpg,photo,0,0,0,0,0,1,1,0,0,4


In [47]:
df.weekday.describe()

count    2187.000000
mean        3.085048
std         1.975628
min         0.000000
25%         1.000000
50%         3.000000
75%         5.000000
max         6.000000
Name: weekday, dtype: float64

In [48]:
# create dummy variables for each of the 7 unique values of the weekday variable and add them to the df
# df2 = df.join(pd.get_dummies(df['weekday']))
df = pd.concat([df, pd.get_dummies(df['weekday'])], axis=1); df.head(1)

Unnamed: 0_level_0,id,hash_select,tweet_id,inserted_date,truncated,language,possibly_sensitive,coordinates,retweeted_status,created_at,content,from_user_screen_name,from_user_id,from_user_followers_count,from_user_friends_count,from_user_listed_count,from_user_statuses_count,from_user_description,from_user_location,from_user_created_at,retweet_count,entities_urls,entities_urls_count,entities_hashtags,entities_hashtags_count,entities_mentions,entities_mentions_count,in_reply_to_screen_name,in_reply_to_status_id,source,entities_expanded_urls,json_output,entities_media_count,media_expanded_url,media_url,media_type,video_link,photo_link,twitpic,mtot,exception_mag,kauffman,retweet_dummy,mention_dummy,url_dummy,weekday,0,1,2,3,4,5,6
created_at_text,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1,Unnamed: 23_level_1,Unnamed: 24_level_1,Unnamed: 25_level_1,Unnamed: 26_level_1,Unnamed: 27_level_1,Unnamed: 28_level_1,Unnamed: 29_level_1,Unnamed: 30_level_1,Unnamed: 31_level_1,Unnamed: 32_level_1,Unnamed: 33_level_1,Unnamed: 34_level_1,Unnamed: 35_level_1,Unnamed: 36_level_1,Unnamed: 37_level_1,Unnamed: 38_level_1,Unnamed: 39_level_1,Unnamed: 40_level_1,Unnamed: 41_level_1,Unnamed: 42_level_1,Unnamed: 43_level_1,Unnamed: 44_level_1,Unnamed: 45_level_1,Unnamed: 46_level_1,Unnamed: 47_level_1,Unnamed: 48_level_1,Unnamed: 49_level_1,Unnamed: 50_level_1,Unnamed: 51_level_1,Unnamed: 52_level_1,Unnamed: 53_level_1
2017-04-21 04:21:17,2555,%2523montanamoment,8.55e+17,35:57.3,0,en,0.0,,,21:17.0,The night sky is a fascinating place especiall...,LeonKauffman,425125748.0,673,152,39,1616,"Hydrologist, photographer, fan of Drexel baske...","Condon, Montana, USA",Wed Nov 30 15:58:25 +0000 2011,4,,0,MontanaMoment,1,,0,,,"<a href=""https://about.twitter.com/products/tw...",,"{u'contributors': None, u'truncated': False, u...",1.0,https://twitter.com/LeonKauffman/status/855275...,http://pbs.twimg.com/media/C96LaxcUQAEMOQl.jpg,photo,0,0,0,0,0,1,1,0,0,4,0,0,0,0,1,0,0


In [49]:
print len(df)

2187


In [50]:
df = df.rename(columns={0: 'monday', 1: 'tuesday', 2: 'wednesday', 3: 'thursday', 4: 'friday', 5: 'saturday', 6: 'sunday'})
print len(df.columns)
df.head(1)

53


Unnamed: 0_level_0,id,hash_select,tweet_id,inserted_date,truncated,language,possibly_sensitive,coordinates,retweeted_status,created_at,content,from_user_screen_name,from_user_id,from_user_followers_count,from_user_friends_count,from_user_listed_count,from_user_statuses_count,from_user_description,from_user_location,from_user_created_at,retweet_count,entities_urls,entities_urls_count,entities_hashtags,entities_hashtags_count,entities_mentions,entities_mentions_count,in_reply_to_screen_name,in_reply_to_status_id,source,entities_expanded_urls,json_output,entities_media_count,media_expanded_url,media_url,media_type,video_link,photo_link,twitpic,mtot,exception_mag,kauffman,retweet_dummy,mention_dummy,url_dummy,weekday,monday,tuesday,wednesday,thursday,friday,saturday,sunday
created_at_text,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1,Unnamed: 23_level_1,Unnamed: 24_level_1,Unnamed: 25_level_1,Unnamed: 26_level_1,Unnamed: 27_level_1,Unnamed: 28_level_1,Unnamed: 29_level_1,Unnamed: 30_level_1,Unnamed: 31_level_1,Unnamed: 32_level_1,Unnamed: 33_level_1,Unnamed: 34_level_1,Unnamed: 35_level_1,Unnamed: 36_level_1,Unnamed: 37_level_1,Unnamed: 38_level_1,Unnamed: 39_level_1,Unnamed: 40_level_1,Unnamed: 41_level_1,Unnamed: 42_level_1,Unnamed: 43_level_1,Unnamed: 44_level_1,Unnamed: 45_level_1,Unnamed: 46_level_1,Unnamed: 47_level_1,Unnamed: 48_level_1,Unnamed: 49_level_1,Unnamed: 50_level_1,Unnamed: 51_level_1,Unnamed: 52_level_1,Unnamed: 53_level_1
2017-04-21 04:21:17,2555,%2523montanamoment,8.55e+17,35:57.3,0,en,0.0,,,21:17.0,The night sky is a fascinating place especiall...,LeonKauffman,425125748.0,673,152,39,1616,"Hydrologist, photographer, fan of Drexel baske...","Condon, Montana, USA",Wed Nov 30 15:58:25 +0000 2011,4,,0,MontanaMoment,1,,0,,,"<a href=""https://about.twitter.com/products/tw...",,"{u'contributors': None, u'truncated': False, u...",1.0,https://twitter.com/LeonKauffman/status/855275...,http://pbs.twimg.com/media/C96LaxcUQAEMOQl.jpg,photo,0,0,0,0,0,1,1,0,0,4,0,0,0,0,1,0,0


In [51]:
len(df)

2187

<h2> By period of day: Morn, Day, Even, Night</h2>
<br>
<br>
>As discussed in the paper, while the Twitter timestamp is in UTC, we'll use local (Mountain) time for the analysis. Since we're dividing the day into four blocks of six hours and have to account for a six hour time difference, we'll just code the variables so that each lags the UTC time by one category. In other words, what would be `night` in UTC will be `evening` in local time.
<br>

In [52]:
df['hour'] = df.index.hour # create a variable whose value equals the hour in which a tweet was sent
df.hour.describe()

count    2187.000000
mean       12.089163
std         7.495400
min         0.000000
25%         4.000000
50%        14.000000
75%        18.000000
max        23.000000
Name: hour, dtype: float64

In [53]:
len(df)

2187

><h3> Evening Variable</h3>

In [54]:
# assign to the var evening the value 1 if the hour is between (and including) 0 and 5, assign 0 otherwise
# This accounts for the difference between UTC & local time (-6 hours. As a UTC variable, this would be 'night')
df['evening'] = np.where((df['hour'] >= 0) & (df['hour'] <= 5), 1, 0)

In [55]:
df.evening.describe()

count    2187.000000
mean        0.307270
std         0.461473
min         0.000000
25%         0.000000
50%         0.000000
75%         1.000000
max         1.000000
Name: evening, dtype: float64

In [56]:
# zero cases of evening when hour > 5, 672 cases when hour < 5 (False)
pd.crosstab(df['hour']>5, df['evening'])

evening,0,1
hour,Unnamed: 1_level_1,Unnamed: 2_level_1
False,0,672
True,1515,0


In [57]:
df[['hour', 'evening']].head(5)

Unnamed: 0_level_0,hour,evening
created_at_text,Unnamed: 1_level_1,Unnamed: 2_level_1
2017-04-21 04:21:17,4,1
2017-04-21 03:15:27,3,1
2017-04-21 03:15:25,3,1
2017-04-20 23:42:12,23,0
2017-04-20 23:26:55,23,0


><h3>Night variable</h3>

In [58]:
# assign to the var night the value 1 if the hour is between (and including) 6am and 11am, assign 0 otherwise
# again, this accounts for the difference between UTC & local time. As a UTC variable, this would be 'morning'
df['night'] = np.where((df['hour'] >= 6) & (df['hour'] <= 11), 1, 0)


In [59]:
print df.night.describe()

count    2187.000000
mean        0.096479
std         0.295314
min         0.000000
25%         0.000000
50%         0.000000
75%         0.000000
max         1.000000
Name: night, dtype: float64


In [60]:
# night is only 1 when hour is less than six or greater than 11
pd.crosstab((df['hour']< 6) | (df['hour'] >11), df['night'])

night,0,1
hour,Unnamed: 1_level_1,Unnamed: 2_level_1
False,0,211
True,1976,0


><h3>Morning variable</h3>

In [61]:
# assign to the var morning the value 1 if the hour is between (and including) 12pm and 5pm, assign 0 otherwise
df['morning'] = np.where((df['hour'] >= 12) & (df['hour'] <= 17), 1, 0)

In [62]:
df.morning.describe()

count    2187.000000
mean        0.293553
std         0.455493
min         0.000000
25%         0.000000
50%         0.000000
75%         1.000000
max         1.000000
Name: morning, dtype: float64

In [63]:
# morning is only 1 when hour is less than 12 or greater than 17 (5pm)
pd.crosstab((df['hour']< 12) | (df['hour'] >17), df['morning'])

morning,0,1
hour,Unnamed: 1_level_1,Unnamed: 2_level_1
False,0,642
True,1545,0


>### Day variable

In [64]:
# assign to the var day the value 1 if the hour is greater than 18 (6pm), assign 0 otherwise (hour var goes up to 23, 11pm)
df['day'] = np.where((df['hour'] >= 18), 1, 0)

In [65]:
df.day.describe()

count    2187.000000
mean        0.302698
std         0.459529
min         0.000000
25%         0.000000
50%         0.000000
75%         1.000000
max         1.000000
Name: day, dtype: float64

In [66]:
# day is only 1 when hour is less than 12 or greater than 17 (5pm)
pd.crosstab((df['hour'] < 18), df['day'])

day,0,1
hour,Unnamed: 1_level_1,Unnamed: 2_level_1
False,0,662
True,1525,0


In [67]:
df[['morning', 'day', 'evening', 'night']].head(3)

Unnamed: 0_level_0,morning,day,evening,night
created_at_text,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
2017-04-21 04:21:17,0,0,1,0
2017-04-21 03:15:27,0,0,1,0
2017-04-21 03:15:25,0,0,1,0


In [68]:
print len(df)
print len(df.columns)

2187
58


# Standardize variables


In [69]:
# now that we're done making time-dependent variables, we'll shift the index back to the tweet-level

In [70]:
df = df.set_index(['from_user_screen_name'])

In [71]:
# let's make a new df with the variables we'll be working with from now on
df2 = df[['retweet_count','retweet_dummy', 'mtot', 'exception_mag', 'kauffman',  'mention_dummy', 'url_dummy', 'monday', 'tuesday', 'wednesday', 'thursday', 'friday', 'saturday', 'sunday', 'morning', 'day', 'evening', 'night', 'from_user_followers_count', 'from_user_friends_count', 'from_user_listed_count', 'from_user_statuses_count', 'entities_mentions_count']]
df2 = df2[df2['exception_mag'] != 1]
print len(df2), "tweets"
print len(df2.columns)
df2.head(10)

1869 tweets
23


Unnamed: 0_level_0,retweet_count,retweet_dummy,mtot,exception_mag,kauffman,mention_dummy,url_dummy,monday,tuesday,wednesday,thursday,friday,saturday,sunday,morning,day,evening,night,from_user_followers_count,from_user_friends_count,from_user_listed_count,from_user_statuses_count,entities_mentions_count
from_user_screen_name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1,Unnamed: 23_level_1
LeonKauffman,4,1,0,0,1,0,0,0,0,0,0,1,0,0,0,0,1,0,673,152,39,1616,0
ErinWx,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,1,0,632,243,47,10327,0
wxmissoula,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,1,0,1550,6,55,10450,0
lastbestbox,0,0,0,0,0,0,1,0,0,0,1,0,0,0,0,1,0,0,186,741,4,284,0
ShannonAMay,0,0,0,0,0,1,0,0,0,0,1,0,0,0,0,1,0,0,1128,766,80,11938,1
UNLV_Sage,0,0,0,0,0,0,1,0,0,0,1,0,0,0,0,1,0,0,1009,2089,21,3447,0
RMKK,1,1,0,0,0,0,1,0,0,0,1,0,0,0,0,1,0,0,217,98,20,4280,0
ShannonAMay,0,0,0,0,0,1,0,0,0,0,1,0,0,0,1,0,0,0,1128,766,80,11938,1
CWPsPhotos,0,0,0,0,0,0,1,0,0,0,1,0,0,0,1,0,0,0,384,552,41,680,0
visitmontana,26,1,1,0,0,0,1,0,0,0,1,0,0,0,1,0,0,0,59595,329,738,11502,0


In [72]:
# default axis is 0 (columns)
df2.mean()

retweet_count                   2.785447
retweet_dummy                   0.356875
mtot                            0.057785
exception_mag                   0.000000
kauffman                        0.054040
mention_dummy                   0.177100
url_dummy                       0.540396
monday                          0.134831
tuesday                         0.125736
wednesday                       0.142857
thursday                        0.153023
friday                          0.149813
saturday                        0.148208
sunday                          0.145532
morning                         0.299090
day                             0.313002
evening                         0.317817
night                           0.070091
from_user_followers_count    7928.935259
from_user_friends_count      2176.244516
from_user_listed_count        193.727662
from_user_statuses_count     9238.946495
entities_mentions_count         0.284109
dtype: float64

In [73]:
# df2.to_csv('aaa_test.csv', sep=',')

In [74]:
pd.set_option('display.float_format', lambda x: '%.2f' % x)

In [75]:
np.round(df2.describe(), 2).T[['count','mean', 'std', 'min', 'max']]

Unnamed: 0,count,mean,std,min,max
retweet_count,1869.0,2.79,8.63,0.0,136.0
retweet_dummy,1869.0,0.36,0.48,0.0,1.0
mtot,1869.0,0.06,0.23,0.0,1.0
exception_mag,1869.0,0.0,0.0,0.0,0.0
kauffman,1869.0,0.05,0.23,0.0,1.0
mention_dummy,1869.0,0.18,0.38,0.0,1.0
url_dummy,1869.0,0.54,0.5,0.0,1.0
monday,1869.0,0.13,0.34,0.0,1.0
tuesday,1869.0,0.13,0.33,0.0,1.0
wednesday,1869.0,0.14,0.35,0.0,1.0


In [76]:
cols_to_norm = ['from_user_followers_count','from_user_friends_count', 'from_user_listed_count', 'from_user_statuses_count', 'entities_mentions_count']
df2[cols_to_norm] = df2[cols_to_norm].apply(lambda x: (x - x.mean()) / (x.std()))

In [77]:
np.round(df2.describe(), 2).T[['count','mean', 'std', 'min', 'max']]

Unnamed: 0,count,mean,std,min,max
retweet_count,1869.0,2.79,8.63,0.0,136.0
retweet_dummy,1869.0,0.36,0.48,0.0,1.0
mtot,1869.0,0.06,0.23,0.0,1.0
exception_mag,1869.0,0.0,0.0,0.0,0.0
kauffman,1869.0,0.05,0.23,0.0,1.0
mention_dummy,1869.0,0.18,0.38,0.0,1.0
url_dummy,1869.0,0.54,0.5,0.0,1.0
monday,1869.0,0.13,0.34,0.0,1.0
tuesday,1869.0,0.13,0.33,0.0,1.0
wednesday,1869.0,0.14,0.35,0.0,1.0


In [78]:
# np.round(df2.describe(), 2).T[['count','mean', 'std', 'min', 'max']].to_csv('aaa_HT_SUM_STATS_final_vars_less_manual.csv', sep=',')

In [79]:
# df2.to_pickle('a_HT_no_retweets_FINAL_VARIABLES_less_manual.pkl')

In [80]:
# save the final variables dataset in csv format to make the dataset portable
# df2.to_csv('HT_orig_less_manual.csv', sep=',')

---
<br>
><br><p style='font-size: large;'>That's it. Now that we've obtained the variables we can in Python, the next step is to hand code the account-type variables used in the study. With the `mtot`, `exception_mag` and `kauffman` variables, this series will be mutually exclusive & jointly exhaustive</p><br>
