# **Compiling Data with MongoDB**

## Create Databases

Start up mongoDB. There are two json files (`tweets2.json` and `tweets4.json`) that need to be imported into the database. This can be done using the terminal commands below:

```bash
mongoimport --db tweets --collection tweets2 --file tweets2.json  

mongoimport --db tweets --collection tweets4 --file tweets4.json
```

The `tweets` database has two collections: `tweets2` and `tweets4`.

## Filter Through Tweets Database

Not all tweets in the database are desired for this project. Specifically, we are not concerned with the following:

1. **Retweets**
2. **Tweets without location info**
    - We want to analyze geographical trends, so tweets without a location attached to them are not desirable.
    
Queries will need to filter out these tweets; all retweets will be omitted, and then any location-related fields (e.g. user.location, geo.coordinates, place.full_name, place.country, and place.country_code) will be called and evaluated.

___
- Fields that will be called in query
    - created_at
    - id
    - full_text
    - entities.hashtags
    - user.name
    - user.screen_name
    - user.location
    - user.followers_count
    - user.friends_count
    - user.created_at
    - user.verified
    - user.statuses_count
    - geo.coordinates
    - place.full_name
    - place.country
    - place.country_code
    - retweet_count
    - favorite_count
    - lang
___

In [2]:
import pandas as pd
import numpy as np
import pickle
import datetime
import re

from pymongo import MongoClient
from pprint import pprint

In [3]:
client = MongoClient()

In [4]:
db = client.tweets

In [5]:
# Verify that tweets2 and tweets4 collections have been successfully
# uploaded to the database.
db.collection_names()

  This is separate from the ipykernel package so we can avoid doing imports until


['tweets2', 'tweets4']

### Query for `tweets2` collection

In [6]:
tweets2_cursor = db.tweets2.find({
                                'retweeted_status': {'$exists': False},
#                                 'geo': {'$ne': None},
#                                 'user.location': {'$ne': None},
#                                 'place.country_code': 'US',
                                'lang': 'en'},
                               {'created_at': 1, 
                                'id': 1,
                                'full_text': 1,
                                'entities.hashtags': 1,
                                'user.name': 1,
                                'user.screen_name': 1,
                                'user.location': 1,
                                'user.followers_count': 1,
                                'user.friends_count': 1,
                                'user.created_at': 1,
                                'user.verified': 1,
                                'user.statuses_count': 1,
                                'geo.coordinates': 1,
                                'place.full_name': 1,
                                'place.country': 1,
                                'place.country_code': 1,
                                'retweet_count': 1,
                                'favorite_count': 1,
                                'lang': 1}
                              )

### Query for `tweets4` collection

In [7]:
tweets4_cursor = db.tweets4.find({
                                'retweeted_status': {'$exists': False},
#                                 'geo': {'$ne': None},
#                                 'user.location': {'$ne': None},
#                                 'place.country_code': 'US',
                                'lang': 'en'},
                               {'created_at': 1, 
                                'id': 1,
                                'full_text': 1,
                                'entities.hashtags': 1,
                                'user.name': 1,
                                'user.screen_name': 1,
                                'user.location': 1,
                                'user.followers_count': 1,
                                'user.friends_count': 1,
                                'user.created_at': 1,
                                'user.verified': 1,
                                'user.statuses_count': 1,
                                'geo.coordinates': 1,
                                'place.full_name': 1,
                                'place.country': 1,
                                'place.country_code': 1,
                                'retweet_count': 1,
                                'favorite_count': 1,
                                'lang': 1}
                              )

### Create `tweets_df` from cursors

In [8]:
def df_generator(cursor_1, cursor_2):
   
    '''
    Takes in two cursors, converts them into dataframes, then
    appends the second dataframe to the first dataframe.
    '''
    
    df_1 = pd.DataFrame(list(cursor_1))
    df_2 = pd.DataFrame(list(cursor_2))
    
    df_combined = df_1.append(df_2)
    
    return df_combined

In [9]:
tweets_df = df_generator(tweets2_cursor, tweets4_cursor)

In [10]:
tweets_df = tweets_df.reset_index(drop=True)

In [11]:
tweets_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4225416 entries, 0 to 4225415
Data columns (total 11 columns):
_id               object
created_at        object
entities          object
favorite_count    int64
full_text         object
geo               object
id                int64
lang              object
place             object
retweet_count     int64
user              object
dtypes: int64(3), object(8)
memory usage: 354.6+ MB


# Clean Up `tweets_df`

Because the dataframe was converted from the mongoDB cursors, some of the columns (specifically "entities", "user", and "place") contain embedded information. Data will need to be extracted from these columns so that they're easily accessible when using `tweets_df` later on.  

A list will be created for each field of interest from these columns. They will then be organized into a dictionary, converted into a dataframe, and added to `tweets_df`.

Additional cleaning will be performed after that, such as:
- Deleting redundant and/or unnecessary columns
- Converting date columns to datetime and adding new columns
- Removing emojis from tweets

## Extract Hashtags from "Entities" Column
- Ultimately, we want to create a list of hashtags for each tweet (i.e. `hashtags_list` below)

In [12]:
hashtags_prelim_list = []

for index, rows in tweets_df['entities'].iteritems():
    for key, value in rows.items():
        if key == 'hashtags':
            hashtags_prelim_list.append(value)

In [13]:
hashtags_spec_list = []
hashtags_list = []

for lists in hashtags_prelim_list:
    for i in lists:
        if type(i) == dict:
            for key, value in i.items():
                if key == 'text':
                    hashtags_spec_list.append(value)
    hashtags_list.append(hashtags_spec_list)
    hashtags_spec_list = []

In [14]:
len(hashtags_list) # Verify length of list = length of tweets_df

4225416

## Extract User Info from "User" Column

We are interested in the information from the following fields in the "user" column:
- user.name
- user.screen_name
- user.location
- user.followers_count
- user.friends_count
- user.created_at
- user.verified
- user.statuses_count 

In [15]:
name_list = []
screen_name_list = []
location_list = []
followers_count_list = []
friends_count_list = []
user_created_at_list = []
verified_list = []
statuses_count_list = []


for index, rows in tweets_df['user'].iteritems():
    for key, value in rows.items():
        if key == 'name':
            name_list.append(value)
        elif key == 'screen_name':
            screen_name_list.append(value)
        elif key == 'location':
            location_list.append(value)
        elif key == 'followers_count':
            followers_count_list.append(value)
        elif key == 'friends_count':
            friends_count_list.append(value)
        elif key == 'created_at':
            user_created_at_list.append(value)
        elif key == 'verified':
            verified_list.append(value)
        elif key == 'statuses_count':
            statuses_count_list.append(value)

## Extract Place Info from "Place" Column

We are interested in the information from the following fields in the "place" column:
- place.full_name
- place.country
- place.country_code

In [16]:
place_name_list = []
place_country_list = []
place_country_code_list = []


for index, rows in tweets_df['place'].iteritems():
    if type(rows) == dict:
        for key, value in rows.items():
            if key == 'full_name':
                place_name_list.append(value)
            elif key == 'country':
                place_country_list.append(value)
            elif key == 'country_code':
                place_country_code_list.append(value)
    else:
        place_name_list.append(np.nan)
        place_country_list.append(np.nan)
        place_country_code_list.append(np.nan)   

## Create supplemental dataframe, add to `tweets_df`, and perform more cleaning

In [17]:
tweet_supp_dict = {'Hashtags_List': hashtags_list,
                   'User_Name': name_list,
                   'User_Screen_Name': screen_name_list,
                   'User_Status_Count': statuses_count_list,
                   'User_Followers_Count': followers_count_list,
                   'User_Friends_Count': friends_count_list,
                   'User_Verified_Status': verified_list,
                   'User_Account_Start_Date': user_created_at_list,
                   'User_Location': location_list,
                   'Tweet_Location': place_name_list,
                   'Tweet_Location_Country': place_country_list,
                   'Tweet_Location_Country_Code': place_country_code_list}

tweet_supp_df = pd.DataFrame(tweet_supp_dict)

In [18]:
tweet_supp_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4225416 entries, 0 to 4225415
Data columns (total 12 columns):
Hashtags_List                  object
User_Name                      object
User_Screen_Name               object
User_Status_Count              int64
User_Followers_Count           int64
User_Friends_Count             int64
User_Verified_Status           bool
User_Account_Start_Date        object
User_Location                  object
Tweet_Location                 object
Tweet_Location_Country         object
Tweet_Location_Country_Code    object
dtypes: bool(1), int64(3), object(8)
memory usage: 358.6+ MB


In [19]:
tweets_full_df = pd.merge(tweets_df, tweet_supp_df, on=tweets_df.index)

In [20]:
tweets_full_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 4225416 entries, 0 to 4225415
Data columns (total 24 columns):
key_0                          int64
_id                            object
created_at                     object
entities                       object
favorite_count                 int64
full_text                      object
geo                            object
id                             int64
lang                           object
place                          object
retweet_count                  int64
user                           object
Hashtags_List                  object
User_Name                      object
User_Screen_Name               object
User_Status_Count              int64
User_Followers_Count           int64
User_Friends_Count             int64
User_Verified_Status           bool
User_Account_Start_Date        object
User_Location                  object
Tweet_Location                 object
Tweet_Location_Country         object
Tweet_Location_Country

In [21]:
tweets_full_df.head(10)

Unnamed: 0,key_0,_id,created_at,entities,favorite_count,full_text,geo,id,lang,place,...,User_Screen_Name,User_Status_Count,User_Followers_Count,User_Friends_Count,User_Verified_Status,User_Account_Start_Date,User_Location,Tweet_Location,Tweet_Location_Country,Tweet_Location_Country_Code
0,0,5e27ac4ae88fe0ef3b045b88,Tue Nov 27 02:45:12 +0000 2018,{'hashtags': []},0,@DRUDGE_REPORT Climate change is not about cli...,,1067247943491702784,en,,...,All4Feedom,1176,51,33,False,Thu May 24 05:06:44 +0000 2018,,,,
1,1,5e27ac4ae88fe0ef3b045b89,Tue Nov 27 02:45:11 +0000 2018,{'hashtags': []},3,@juliabanksmp The statement clearly put the vi...,,1067247939439943680,en,,...,Zwickyhumason1,577,11,48,False,Fri Oct 14 19:01:14 +0000 2016,"Zwettl-Lower Austria, Austria",,,
2,2,5e27ac4ae88fe0ef3b045b8a,Tue Nov 27 02:45:12 +0000 2018,{'hashtags': []},0,BBC News - Trump on climate change report: 'I ...,,1067247941960626176,en,,...,rbmumsie,118235,952,609,False,Sun Apr 21 23:44:26 +0000 2013,"Colorado, USA",,,
3,3,5e27ac4ae88fe0ef3b045b8c,Tue Nov 27 02:45:12 +0000 2018,"{'hashtags': [{'text': 'climatechange', 'indic...",0,The info on #climatechange the tRump regime di...,,1067247940094222342,en,,...,booksanescape,83238,12482,13600,False,Thu Jun 13 03:15:34 +0000 2013,"Cali Girl, USA",,,
4,4,5e27ac4ae88fe0ef3b045b96,Tue Nov 27 02:45:16 +0000 2018,"{'hashtags': [{'text': 'climatechange', 'indic...",0,.Gov. Defensor of Iloilo: “I have no idea abou...,,1067247957508911105,en,"{'full_name': 'Iloilo City, Western Visayas', ...",...,icleiseas,2718,996,386,False,Fri Dec 13 09:50:27 +0000 2013,"Manila, Philippines","Iloilo City, Western Visayas",Republic of the Philippines,PH
5,5,5e27ac4ae88fe0ef3b045b9a,Tue Nov 27 02:45:17 +0000 2018,{'hashtags': []},0,if you start your sentence with I'm not a sci...,,1067247961434861568,en,,...,i_sharyn,8062,27,194,False,Tue Jul 03 00:55:41 +0000 2018,,,,
6,6,5e27ac4ae88fe0ef3b045ba3,Tue Nov 27 02:45:18 +0000 2018,{'hashtags': []},0,"Not Buffoonery, rather pure ignorance. https:/...",,1067247967361556480,en,,...,RobertLongView,4464,27,108,False,Thu Sep 06 02:27:46 +0000 2012,,,,
7,7,5e27ac4ae88fe0ef3b045ba4,Tue Nov 27 02:45:18 +0000 2018,{'hashtags': []},7,Sen. Mike Lee gives a perfectly ridiculous rea...,,1067247966363238400,en,,...,thinkprogress,152355,826870,892,True,Thu Jul 09 20:42:08 +0000 2009,"Washington, D.C.",,,
8,8,5e27ac4ae88fe0ef3b045ba5,Tue Nov 27 02:45:21 +0000 2018,{'hashtags': []},0,"@CNN Well, maybe the Global warming might even...",,1067247979118120960,en,,...,PaulusMcNaulus,5059,6,53,False,Sun Apr 24 15:20:14 +0000 2016,,,,
9,9,5e27ac4ae88fe0ef3b045bbf,Tue Nov 27 02:45:26 +0000 2018,{'hashtags': []},1,Add...\nA bunch of government scientists lande...,,1067248002178404353,en,,...,AlamoOnTheRise,132668,26195,26169,False,Thu Apr 01 02:56:57 +0000 2010,"San Antonio, Texas",,,


### Delete Redundant/Unnecessary Columns

In [22]:
# delete'key_0' column since it is simply a duplicate of the index.

del tweets_full_df['key_0']

In [23]:
# Verify all tweets are in english.  This column can be dropped
# as well, which is shown in the next cell.

tweets_full_df['lang'].unique()

array(['en'], dtype=object)

In [24]:
# 'entities', 'user', and 'place' columns can be deleted since all
# the desired info has been extracted from them. The id columns
# will be deleted as well.

tweets_full_df = tweets_full_df.drop(['_id', 'entities', 'id', 'user', 'lang', 'place'], axis = 1)


In [25]:
tweets_full_df['User_Location'].value_counts()

                                  966055
United States                      69830
Washington, DC                     40745
London                             36788
New York, NY                       35972
Australia                          33460
Globally l Planet Earth            32499
London, England                    32051
Canada                             30939
USA                                29918
Finland                            28460
Global                             26576
California, USA                    24574
Globally l Online l Earth          23026
Online l Globally l Earth          22793
United Kingdom                     20690
Los Angeles, CA                    19736
Earth                              18482
UK                                 18032
Sydney, New South Wales            16679
New York, USA                      15279
Florida, USA                       15199
Worldwide                          15086
Toronto, Ontario                   14467
Boston, MA      

In [26]:
# Replace blank 'User_Location' values with NaN's

tweets_full_df.User_Location = tweets_full_df.User_Location.replace({"": np.nan})

In [27]:
tweets_full_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 4225416 entries, 0 to 4225415
Data columns (total 17 columns):
created_at                     object
favorite_count                 int64
full_text                      object
geo                            object
retweet_count                  int64
Hashtags_List                  object
User_Name                      object
User_Screen_Name               object
User_Status_Count              int64
User_Followers_Count           int64
User_Friends_Count             int64
User_Verified_Status           bool
User_Account_Start_Date        object
User_Location                  object
Tweet_Location                 object
Tweet_Location_Country         object
Tweet_Location_Country_Code    object
dtypes: bool(1), int64(5), object(11)
memory usage: 552.1+ MB


### Change dates to datetime, insert User_Years_Active column

In [28]:
# Convert 'created_at' column to datetime

new_time_string = []
new_time_dt = []

for index, rows in tweets_full_df['created_at'].iteritems():
    new_time_string.append(rows.replace('+0000 ', ''))

for i in new_time_string:
    i_time = datetime.datetime.strptime(i, '%a %b %d %H:%M:%S %Y')
    i_time_str = datetime.datetime.strftime(i_time, '%m/%d/%Y')
    new_time_dt.append(datetime.datetime.strptime(i_time_str, '%m/%d/%Y'))   

tweets_full_df['created_at'] = pd.Series(new_time_dt)

In [29]:
# Convert 'User_Account_Start_Date' column to datetime

new_time_string2 = []
new_time_dt2 = []

for index, rows in tweets_full_df['User_Account_Start_Date'].iteritems():
    new_time_string2.append(rows.replace('+0000 ', ''))

for i in new_time_string2:
    i_time2 = datetime.datetime.strptime(i, '%a %b %d %H:%M:%S %Y')
    i_time_str2 = datetime.datetime.strftime(i_time2, '%m/%d/%Y')
    new_time_dt2.append(datetime.datetime.strptime(i_time_str2, '%m/%d/%Y'))   

tweets_full_df['User_Account_Start_Date'] = pd.Series(new_time_dt2)

In [30]:
# Create years_active list

days_active = tweets_full_df['created_at'] - tweets_full_df['User_Account_Start_Date']

years_active = []

for index, rows in days_active.iteritems():
    years_active.append(round(rows.days/365.25, 1))

In [31]:
tweets_full_df.insert(13, column = 'User_Years_Active', value = pd.Series(years_active))


### Remove Emojis from Tweets

In [36]:
# Function to remove emojis

import emoji
import string

def give_emoji_free_text(text):
    
    '''
    Takes in tweet series, removes all emojis, and returns cleaned tweet series.
    '''
    
    allchars = [str for str in text]
    emoji_list = [c for c in allchars if c in emoji.UNICODE_EMOJI]
    clean_text = ' '.join([str for str in text.split() if not any(i in str for i in emoji_list)])

    return clean_text

In [38]:
full_text_list = []

for index, rows in tweets_full_df['full_text'].iteritems():
    rows = give_emoji_free_text(rows) # remove emojis
    full_text_list.append(rows)
    
tweets_full_df['full_text'] = full_text_list

In [40]:
tweets_full_df.tail(10)

Unnamed: 0,created_at,favorite_count,full_text,geo,retweet_count,Hashtags_List,User_Name,User_Screen_Name,User_Status_Count,User_Followers_Count,User_Friends_Count,User_Verified_Status,User_Account_Start_Date,User_Years_Active,User_Location,Tweet_Location,Tweet_Location_Country,Tweet_Location_Country_Code
4225406,2017-10-23,4,"""We are in the unsustainable future."" #climate...",,5,"[climatechange, santarosafire, climaterecovery...",Our Children's Trust,youthvgov,9388,12076,623,False,2011-03-18,6.6,"Eugene, OR",,,
4225407,2017-10-23,1,After #TyphoonLan passed through Tokyo: Dry st...,,0,"[TyphoonLan, ClimateChange, ClimateFinance]",RRC.AP,RRCAP_AIT,310,124,175,False,2017-08-02,0.2,"Pathum Thani, Thailand",,,
4225408,2017-10-23,0,The EPA has canceled speaking appearances of t...,,0,[],Jake Cornwall,JakeM_1998,512647,1222,1794,False,2015-02-11,2.7,"United Kingdom, London",,,
4225409,2017-10-23,2,Check out @levinsources minerals behind #Green...,,1,"[GreenEconomy, climatechange]",Canadian Intl Resources & Development Institute,CIRDI_ICIRD,2237,1274,929,False,2014-01-27,3.7,"Vancouver, British Columbia, Canada",,,
4225410,2017-10-23,0,The EPA has canceled speaking appearances of t...,,0,[],Electro Edward!,electro_edward,109329,44,210,False,2017-03-04,0.6,"Oklahoma, USA",,,
4225411,2017-10-23,5,"Working with satellite data, scientists detect...",,5,[],Ian James,ByIanJames,19192,7423,1364,True,2009-04-03,8.6,"Phoenix, Arizona",,,
4225412,2017-10-23,0,EPA pulls scientists out of climate change con...,,0,[],Mohammad Keshtkar,k3shtk4r,944229,346,86,False,2010-09-24,7.1,,,,
4225413,2017-10-23,0,"Even the earth has rights in #Islam , so treat...",,0,"[Islam, Mercy]",This Is Islam,TII99,161597,1653,1580,False,2014-04-01,3.6,حساب للدعوة باللغة الانجليزية,,,
4225414,2017-10-23,0,Remember My Lai Massacre！ #ClimateChange #Anon...,,1,"[ClimateChange, Anonymous, WikiLeaks]",サイダーラジオは今日も言いたい放題,applecider52,128019,4914,3163,False,2009-05-12,8.4,kyoto-Shiga,,,
4225415,2017-10-23,9,Once spoke to a woman you said she's hoping th...,,10,[],Nita Cosby,5_2blue,93344,26070,22632,False,2010-02-02,7.7,"Houston, TX",,,


# Save Dataframe as pickle file

In [41]:
with open('tweets_full2_df.pkl', 'wb') as to_write:
    pickle.dump(tweets_full_df, to_write)

# Next Notebook: "Project_04_Classifier_R1"
- In this notebook, the tweets in `tweets_full_df` will be analyzed and classified as either "believer" or "denier" tweets

___

**Quick Tweet Cleanup (NOT USED)**
- *integrated into CountVectorizer in next notebook*

- *Remove the following characteristics from the tweets using custom function `tweet_cleanup`:*
    - *line breaks*
    - *URL's*
    - *emojis*
    - *numbers*
    - *capital letters*
    - *punctuation*

In [105]:
# def tweet_cleanup(tweet_dataframe):

#     '''
#     Takes in tweet dataframe and cleans up the tweet column by removing line breaks,
#     URL's, emojis, numbers, capital letters, and punctuation. Returns complete dataframe
#     with cleaned tweet column.
#     '''

#     full_text_list = []

#     for index, rows in tweet_dataframe['full_text'].iteritems():
#         rows = rows.replace('\n', ' ') # remove line breaks
#         rows = re.sub(r"\bhttps://t.co/\w+", '', rows) # remove URL's
#         rows = give_emoji_free_text(rows) # remove emojis
#         full_text_list.append(rows)
    
#     tweet_dataframe['full_text'] = full_text_list

#     # Remove, numbers, capital letters, and punctuation
#     alphanumeric = lambda x: re.sub('\w*\d\w*', ' ',x)
#     punc_lower = lambda x: re.sub('[%s]' % re.escape(string.punctuation), ' ', x.lower())

#     tweet_dataframe['full_text'] = tweet_dataframe.full_text.map(alphanumeric).map(punc_lower)
    
#     return tweet_dataframe

In [106]:
# tweets_clean_df = tweet_cleanup(tweets_full_df)

___

In [4]:
# with open('tweets_subset2.pkl','rb') as read_file:
#     tweets_subset2_df = pickle.load(read_file)

In [5]:
# us_abbrevs = ['US', 'USA', 'United States',
# 'Alabama',
# 'AL',
# 'Alaska',
# 'AK',
# 'Arizona',
# 'AZ',
# 'Arkansas',
# 'AR',
# 'California',
# 'CA',
# 'Colorado',
# 'CO',
# 'Connecticut',
# 'CT',
# 'Delaware',
# 'DE',
# 'Florida',
# 'FL',
# 'Georgia',
# 'GA',
# 'Hawaii',
# 'HI',
# 'Idaho',
# 'ID',
# 'Illinois',
# 'IL',
# 'Indiana',
# 'IN',
# 'Iowa',
# 'IA',
# 'Kansas',
# 'KS',
# 'Kentucky',
# 'KY',
# 'Louisiana',
# 'LA',
# 'Maine',
# 'ME',
# 'Maryland',
# 'MD',
# 'Massachusetts',
# 'MA',
# 'Michigan',
# 'MI',
# 'Minnesota',
# 'MN',
# 'Mississippi',
# 'MS',
# 'Missouri',
# 'MO',
# 'Montana',
# 'MT',
# 'Nebraska',
# 'NE',
# 'Nevada',
# 'NV',
# 'New Hampshire',
# 'NH',
# 'New Jersey',
# 'NJ',
# 'New Mexico',
# 'NM',
# 'New York',
# 'NY',
# 'North Carolina',
# 'NC',
# 'North Dakota',
# 'ND',
# 'Ohio',
# 'OH',
# 'Oklahoma',
# 'OK',
# 'Oregon',
# 'OR',
# 'Pennsylvania',
# 'PA',
# 'Rhode Island',
# 'RI',
# 'South Carolina',
# 'SC',
# 'South Dakota',
# 'SD',
# 'Tennessee',
# 'TN',
# 'Texas',
# 'TX',
# 'Utah',
# 'UT',
# 'Vermont',
# 'VT',
# 'Virginia',
# 'VA',
# 'Washington',
# 'WA',
# 'West Virginia',
# 'WV',
# 'Wisconsin',
# 'WI',
# 'Wyoming',
# 'WY',
# 'District of Columbia',
# 'DC']

In [32]:
# tweets_subset2_df['User_Location'][90:108]

90                     Boulder, CO
91                        New York
92                       Worldwide
93                            here
94          Montréal Québec Canada
95             Berlin, Deutschland
96                           World
97                        Canberra
98                 London, England
99                Rome, Italy (EU)
100               Iqaluit, Nunavut
101                        Finland
102            Canterbury, England
103                      Hong Kong
104                       Canberra
105            HamOnt via Brampton
106    Melbourne, Vic. Bowen, Q'ld
107                  Cambridge, MA
Name: User_Location, dtype: object

In [24]:
# location_dict = {}

# for index, rows in tweets_subset2_df['User_Location'].iteritems():
#     for loc in us_abbrevs:
#         if str(loc) in str(rows):
#             location_dict[index] = rows
#             continue

In [34]:
# hashtag_collapse = []

# for index, rows in tweets_subset2_df['Hashtags_List'].iteritems():
#     for i in rows:
#         hashtag_collapse.append(i)

In [39]:
# pd.Series(hashtag_collapse).value_counts()[100:150]

LNP               184
environmental     182
UN                182
innovation        181
KAG               179
oceans            178
Resist            178
abpoli            177
GOP               176
Energy            176
Weather           175
floods            175
VoteThemOut       174
Green             173
Democracy         173
leadership        172
ocean             172
WaterIsLife       171
CCOT              170
technology        169
ableg             168
Ontario           166
MAGA2018          166
WakeUpAmerica     166
VoteDemocrat      166
wildlife          166
conservation      163
KAG2020           163
MAGA2020          163
RenewablesNow     163
trump             163
activism          162
SmartNews         162
Florida           162
KAG2018           161
feedly            161
change            160
Christian         160
flooding          159
deforestation     159
Armageddon        158
vegan             157
Solar             156
renewable         155
Geoengineering    153
Africa    