# Data Preparation


## Data Details

Using both externally collected and personally collected Tweets. 

### External Data 
* [Tweets about distance learning] (https://www.kaggle.com/barishasdemir/tweets-about-distance-learning)
* Collected by the refrenced Kaggle user with the Twitter API via tweepy. 
* Collection Dates: July 23 2020 through August 14 2020
* Location: Worldwide
* Hashtags: 
    * #distancelearning, #onlineschool, #onlineteaching, #virtuallearning, #onlineducation, #distanceeducation, #OnlineClasses, #DigitalLearning, #elearning, #onlinelearning
* Keywords: 
    * “distance learning”, “online teaching”, “online education”, “online course”, “online semester”, “distance course”, “distance education”, “online class”,” e-learning”, “e learning"

### Collected Data
* Collected by the author using the Twitter API via __tweepy__. Tweets collected via Queries and Streaming. Did not collect Retweets.
* Stored in __MongoDb__ then exported to csv files for further processing.
* I performed sentiment labeling for a portion of the Tweets I collected. Took a hybrid approach of some human labeling and sentiment tool labeling using __VADER__ and __TextBlob__ tools.
* Collection Dates: January 06 2021 through January 14 2021
* Location: United States
* Keywords/Query terms:
    * 'k-12 remote', 'k-12 distance', 'k-12 (on-line OR online)', 'k-12 virtual', 'k-12 hybrid', 'teach remote learn', 'teach distance learn', 'teach (on-line OR online) learn', 'teach virtual learn', 'teach hybrid learn', '(kid OR child) remote learn', '(kid OR child) distance learn', '(kid OR child) (on-line OR online) learn', '(kid OR child) virtual learn', '(kid OR child) hybrid learn'


### Data Prep steps: 
* Repeat for each data set:
    * Read in csv
    * (External only) Standardize column names as needed    
    * (External only) Select only tweets with likely US locations
    * Select Tweets with character count 100 and above
    * Apply the VADER and Text Blob sentiment tools to Tweets that are not human labeled. 
    * Select Tweets where both tools determine the __same__ sentiment (alternative to Human labeling)
    * Save individual dataset to file
* Combine all data sets to ONE dataframe 
* Check for null and duplicate content
* Save to ONE csv file



In [5]:
# Import the required libraries
import pandas as pd
import numpy as np
import re

from vaderSentiment.vaderSentiment import SentimentIntensityAnalyzer
from textblob import TextBlob
 

In [6]:
# Functions 
usa_states_fullname_regex = '(ALABAMA|ALASKA|ARIZONA|ARKANSAS|CALIFORNIA|'\
                            'COLORADO|CONNECTICUT|DELAWARE|FLORIDA|GEORGIA|HAWAII|'\
                            'IDAHO|ILLINOIS|INDIANA|IOWA|KANSAS|KENTUCKY|'\
                            'LOUISIANA|MAINE|MARYLAND|MASSACHUSETTS|MICHIGAN|'\
                            'MINNESOTA|MISSISSIPPI|MISSOURI|MONTANA|'\
                            'NEBRASKA|NEVADA|NEW\sHAMPSHIRE|NEWSJERSEY|'\
                            'NEW\sMEXICO|NEW\sYORK|NORTH\sCAROLINA|'\
                            'NORTH\sDAKOTA|OHIO|OKLAHOMA|OREGON|PENNSYLVANIA|'\
                            'RHODE\sISLAND|SOUTH\sCAROLINA|SOUTH\sDAKOTA|'\
                            'TENNESSEE|TEXAS|UTAH|VERMONT|VIRGINIA|'\
                            'WASHINGTON|WEST\sVIRGINIA|WISCONSIN|WYOMING|USA)'


usa_states_regex = ',\s{1}(A[KLRZ]|C[AOT]|D[CE]|FL|GA|HI|I[ADLN]|K[SY]|LA|M[ADEINOST]|N[CDEHJMVY]|O[HKR]|P[AR]|RI|S[CD]|T[NX]|UT|V[AIT]|W[AIVY])'

def get_exact_dups(df):
    '''
    Returns duplicates
    '''
    dups = df[df.duplicated()]
    return dups

def get_tweet_dups(df, col_names):
    '''
    Returns duplicates based on given column name
    '''
    dups = df[df.duplicated(subset=col_names)]
    return dups

def get_is_us_loc(loc_string):
    '''
    Uses regular expression(s) to detect if provided Location string 
    is a probable United States location.
    '''
    matches_abbrev = bool(re.search(usa_states_regex, loc_string.upper()))
    if not matches_abbrev:
        matches_full_name = bool(re.search(usa_states_fullname_regex, loc_string.upper())) 
    return (matches_abbrev or matches_full_name)

def get_vader_sentiment(vader_analyzer, tweet):
    '''
    Get sentiment of given Tweet text using VADER sentiment tool.
    '''
    tweet = tweet.replace('#','')  # we want things like #fail to be included in text
    vader_scores = vader_analyzer.polarity_scores(tweet)
    compound_score = vader_scores['compound']
    vader_sentiment = None
    # using thresholds from VADER developers/researchers
    if (compound_score >= 0.05):
        vader_sentiment = 'positive'
    elif (compound_score < 0.05 and compound_score > -0.05):
        vader_sentiment = 'neutral'
    elif (compound_score <= -0.05):
        vader_sentiment = 'negative'
    return vader_sentiment

def get_text_blob_sentiment(tweet):
    '''
    Get sentiment of given Tweet text using TextBlob sentiment tool
    '''
    polarity = TextBlob(tweet).sentiment.polarity
    # The polarity score is a float within the range [-1.0, 1.0]. 
    textblob_sentiment = None
    if (polarity > 0):
        textblob_sentiment = 'positive'
    elif (polarity == 0):
        textblob_sentiment = 'neutral'
    elif (polarity < 0):
        textblob_sentiment = 'negative'
    return textblob_sentiment  
    

def get_tools_match(vader_sentiment, tb_sentiment):
    '''
    Return True if the provided sentiment labels match, False if not.
    '''
    return vader_sentiment == tb_sentiment


In [40]:
# Read in the external data: 
external_data = pd.read_csv('../external_data/tweets_raw.csv')
print(external_data.info())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 202645 entries, 0 to 202644
Data columns (total 8 columns):
 #   Column         Non-Null Count   Dtype 
---  ------         --------------   ----- 
 0   Unnamed: 0     202645 non-null  int64 
 1   Unnamed: 0.1   202645 non-null  int64 
 2   Content        202645 non-null  object
 3   Location       155123 non-null  object
 4   Username       202645 non-null  object
 5   Retweet-Count  202645 non-null  int64 
 6   Favorites      202645 non-null  int64 
 7   Created at     202645 non-null  object
dtypes: int64(4), object(4)
memory usage: 12.4+ MB
None


In [41]:
#rename columns
external_data.rename(columns = { 'Content':'content',
                                 'Location':'user_loc', 
                                 'Username':'user_screen_name', 
                                 'Retweet-Count':'retweet_count', 
                                 'Favorites':'fav_count', 
                                 'Created at': 'created_at'}, inplace = True) 
external_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 202645 entries, 0 to 202644
Data columns (total 8 columns):
 #   Column            Non-Null Count   Dtype 
---  ------            --------------   ----- 
 0   Unnamed: 0        202645 non-null  int64 
 1   Unnamed: 0.1      202645 non-null  int64 
 2   content           202645 non-null  object
 3   user_loc          155123 non-null  object
 4   user_screen_name  202645 non-null  object
 5   retweet_count     202645 non-null  int64 
 6   fav_count         202645 non-null  int64 
 7   created_at        202645 non-null  object
dtypes: int64(4), object(4)
memory usage: 12.4+ MB


In [42]:
# Drop the rows with null user_location and Duplicated content
external_data.dropna(subset=['user_loc'], inplace=True)
external_data.drop_duplicates(subset=['content'], inplace=True)
print(external_data.shape)

(139410, 8)


In [43]:
# Check location for a US state (using regex). We only want to use Tweets with a US location
external_data['is_us_loc'] = external_data.apply(lambda row: get_is_us_loc(row['user_loc']), axis=1)
us_only_data = external_data[external_data['is_us_loc'] == True]
us_only_data.drop('is_us_loc', axis=1, inplace=True)
us_only_data.shape

(64344, 8)

In [44]:
# Get the char count
us_only_data['char_count'] = us_only_data.apply(lambda row: len(row['content']), axis=1)
us_only_data.head(3)

Unnamed: 0.2,Unnamed: 0,Unnamed: 0.1,content,user_loc,user_screen_name,retweet_count,fav_count,created_at,char_count
9,9,9,“Instructional Considerations for the 2020-21 ...,"Illinois, USA",Erik_Youngman,0,2,2020-08-02 00:10:26,276
10,10,10,With all the uncertainty of what September wil...,"Lyndhurst, NJ",Renee_LoBue,0,0,2020-08-01 23:57:31,264
11,11,11,Check this out on Wakelet - Digital learning a...,"Cary, NC",SupriyaVasu,0,0,2020-08-01 23:20:38,133


In [45]:
# Keep only the tweets with 100 and above characters
us_only_data = us_only_data[us_only_data['char_count'] >= 100]
us_only_data.shape

(58941, 9)

In [46]:
# Get the VADER and TextBlob sentiments
analyzer = SentimentIntensityAnalyzer()
us_only_data['vader_sentiment'] = us_only_data.apply(lambda row: get_vader_sentiment(analyzer, row['content']), axis=1)
us_only_data['text_blob_sentiment'] = us_only_data.apply(lambda row: get_text_blob_sentiment(row['content']), axis=1)

# Mark where the two tools agree
us_only_data['tools_match'] = us_only_data.apply(lambda row: get_tools_match(row['vader_sentiment'], row['text_blob_sentiment']), axis=1)

# keep only the tweets where the tools agree 
us_only_data = us_only_data[us_only_data['tools_match']]
us_only_data.head(3)

Unnamed: 0.2,Unnamed: 0,Unnamed: 0.1,content,user_loc,user_screen_name,retweet_count,fav_count,created_at,char_count,vader_sentiment,text_blob_sentiment,tools_match
9,9,9,“Instructional Considerations for the 2020-21 ...,"Illinois, USA",Erik_Youngman,0,2,2020-08-02 00:10:26,276,neutral,neutral,True
10,10,10,With all the uncertainty of what September wil...,"Lyndhurst, NJ",Renee_LoBue,0,0,2020-08-01 23:57:31,264,positive,positive,True
11,11,11,Check this out on Wakelet - Digital learning a...,"Cary, NC",SupriyaVasu,0,0,2020-08-01 23:20:38,133,neutral,neutral,True


In [47]:
# Now create cols for sentiment and sentiment method
us_only_data['sentiment_method'] = 'tools'
us_only_data['sentiment'] = us_only_data.apply(lambda row: row['vader_sentiment'], axis=1)

# delete the now un-needed columns
us_only_data.drop('vader_sentiment', axis=1, inplace=True)
us_only_data.drop('text_blob_sentiment', axis=1, inplace=True)
us_only_data.drop('tools_match', axis=1, inplace=True)
us_only_data.head(3)

Unnamed: 0.2,Unnamed: 0,Unnamed: 0.1,content,user_loc,user_screen_name,retweet_count,fav_count,created_at,char_count,sentiment_method,sentiment
9,9,9,“Instructional Considerations for the 2020-21 ...,"Illinois, USA",Erik_Youngman,0,2,2020-08-02 00:10:26,276,tools,neutral
10,10,10,With all the uncertainty of what September wil...,"Lyndhurst, NJ",Renee_LoBue,0,0,2020-08-01 23:57:31,264,tools,positive
11,11,11,Check this out on Wakelet - Digital learning a...,"Cary, NC",SupriyaVasu,0,0,2020-08-01 23:20:38,133,tools,neutral


In [48]:
us_only_data.shape

(38044, 11)

In [49]:
# SAVE to file
us_only_data.to_csv('../data/us_only_external_data_tweets_TOOL_labeled.csv')

## Now prepare the data I collected in January.  
### Only keep tweets of 100 characters or more (same criteria applied to external dataset of tweets)
### I labeled 356 tweets for sentiment. For time contraints, use the tools to labels the rest - use sentiment from tools where both agree.

### Tweets from Query in January - HUMAN sentiment label

In [50]:
human_labeled_q_tweets =  pd.read_csv('../data/jan_queried_tweets_HUMAN_labeled.csv')
human_labeled_q_tweets.head(3)

Unnamed: 0,_id,id_str,content,user_loc,user_screen_name,retweet_count,fav_count,created_at,sentiment
0,5ffde71b5e4953000d99dc7c,1.348996e+18,These really are critical.\n2yrs ago I took 9 ...,"Texas, USA",summers_llm,0,0,2021-01-12 14:11:21,neutral
1,5ffde71b5e4953000d99dc7d,1.348986e+18,ConnectEd After School Lesson Grades K-2\nThur...,"North Dakota, USA",ncecnd,0,0,2021-01-12 13:30:04,neutral
2,5ffde71b5e4953000d99dc7e,1.348978e+18,Don't forget to register for our FREE Remote a...,New York City,ReadWorks,0,0,2021-01-12 13:01:36,neutral


In [51]:
# Keep only the tweets 100 chars and over
human_labeled_q_tweets['char_count'] = human_labeled_q_tweets.apply(lambda row: len(row['content']), axis=1)
human_labeled_q_tweets = human_labeled_q_tweets[human_labeled_q_tweets['char_count'] >= 100]
human_labeled_q_tweets.shape

(348, 10)

In [52]:
# Flag these tweets as having their sentiment set by a human
human_labeled_q_tweets['sentiment_method'] = 'human'
human_labeled_q_tweets.head(3)

Unnamed: 0,_id,id_str,content,user_loc,user_screen_name,retweet_count,fav_count,created_at,sentiment,char_count,sentiment_method
0,5ffde71b5e4953000d99dc7c,1.348996e+18,These really are critical.\n2yrs ago I took 9 ...,"Texas, USA",summers_llm,0,0,2021-01-12 14:11:21,neutral,213,human
1,5ffde71b5e4953000d99dc7d,1.348986e+18,ConnectEd After School Lesson Grades K-2\nThur...,"North Dakota, USA",ncecnd,0,0,2021-01-12 13:30:04,neutral,122,human
2,5ffde71b5e4953000d99dc7e,1.348978e+18,Don't forget to register for our FREE Remote a...,New York City,ReadWorks,0,0,2021-01-12 13:01:36,neutral,303,human


In [53]:
human_labeled_q_tweets.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 348 entries, 0 to 355
Data columns (total 11 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   _id               348 non-null    object 
 1   id_str            348 non-null    float64
 2   content           348 non-null    object 
 3   user_loc          348 non-null    object 
 4   user_screen_name  348 non-null    object 
 5   retweet_count     348 non-null    int64  
 6   fav_count         348 non-null    int64  
 7   created_at        348 non-null    object 
 8   sentiment         348 non-null    object 
 9   char_count        348 non-null    int64  
 10  sentiment_method  348 non-null    object 
dtypes: float64(1), int64(3), object(7)
memory usage: 32.6+ KB


In [54]:
# save to file 
human_labeled_q_tweets.to_csv('../data/jan_2021_queried_tweets_HUMAN_labeled.csv')

### Tweets from query in January - not labeled yet


In [55]:
q_tweets =  pd.read_csv('../data/jan_queried_tweets_NO_label.csv')
q_tweets.head(3)

Unnamed: 0,_id,id_str,content,user_loc,user_screen_name,retweet_count,fav_count,created_at
0,5ffde71f5e4953000d99dd55,1.348735e+18,Virtual learning has caused a fair share of fr...,Wisconsin,Edficiency,0,0,2021-01-11 20:55:39
1,5ffde71f5e4953000d99dd58,1.348704e+18,Preparing for high school is an exciting time!...,"St. Augustine, Florida",SJCSD,0,0,2021-01-11 18:51:41
2,5ffde71f5e4953000d99dd59,1.348685e+18,Updated list of Wi-Fi Hotspots in Accomack Cou...,"Oak Hall, Virginia",AMSPANTHERS,2,2,2021-01-11 17:33:40


In [56]:
q_tweets.shape

(222, 8)

In [57]:
# Keep only the tweets 100 chars and over
q_tweets['char_count'] = q_tweets.apply(lambda row: len(row['content']), axis=1)
q_tweets = q_tweets[q_tweets['char_count'] >= 100]
print(q_tweets.shape)
q_tweets.head(3)

(212, 9)


Unnamed: 0,_id,id_str,content,user_loc,user_screen_name,retweet_count,fav_count,created_at,char_count
0,5ffde71f5e4953000d99dd55,1.348735e+18,Virtual learning has caused a fair share of fr...,Wisconsin,Edficiency,0,0,2021-01-11 20:55:39,272
1,5ffde71f5e4953000d99dd58,1.348704e+18,Preparing for high school is an exciting time!...,"St. Augustine, Florida",SJCSD,0,0,2021-01-11 18:51:41,221
2,5ffde71f5e4953000d99dd59,1.348685e+18,Updated list of Wi-Fi Hotspots in Accomack Cou...,"Oak Hall, Virginia",AMSPANTHERS,2,2,2021-01-11 17:33:40,146


In [58]:
# Get the VADER and TextBlob sentiments
analyzer = SentimentIntensityAnalyzer()
q_tweets['vader_sentiment'] = q_tweets.apply(lambda row: get_vader_sentiment(analyzer, row['content']), axis=1)
q_tweets['text_blob_sentiment'] = q_tweets.apply(lambda row: get_text_blob_sentiment(row['content']), axis=1)

# Mark where the two tools agree
q_tweets['tools_match'] = q_tweets.apply(lambda row: get_tools_match(row['vader_sentiment'], row['text_blob_sentiment']), axis=1)

# keep only the tweets where the tools agree 
q_tweets = q_tweets[q_tweets['tools_match']]
print(q_tweets.shape)
q_tweets.head(3)


(130, 12)


Unnamed: 0,_id,id_str,content,user_loc,user_screen_name,retweet_count,fav_count,created_at,char_count,vader_sentiment,text_blob_sentiment,tools_match
0,5ffde71f5e4953000d99dd55,1.348735e+18,Virtual learning has caused a fair share of fr...,Wisconsin,Edficiency,0,0,2021-01-11 20:55:39,272,positive,positive,True
1,5ffde71f5e4953000d99dd58,1.348704e+18,Preparing for high school is an exciting time!...,"St. Augustine, Florida",SJCSD,0,0,2021-01-11 18:51:41,221,positive,positive,True
3,5ffde71f5e4953000d99dd5d,1.348672e+18,The Nature-based 4K Parent Night for enrollmen...,"Newburg, WI",RiveredgeNC,0,0,2021-01-11 16:42:40,276,positive,positive,True


In [59]:
# Now create cols for sentiment and sentiment method
q_tweets['sentiment_method'] = 'tools'
q_tweets['sentiment'] = q_tweets.apply(lambda row: row['vader_sentiment'], axis=1)

# delete the now un-needed columns
q_tweets.drop('vader_sentiment', axis=1, inplace=True)
q_tweets.drop('text_blob_sentiment', axis=1, inplace=True)
q_tweets.drop('tools_match', axis=1, inplace=True)
print(q_tweets.shape)
q_tweets.head(3)

(130, 11)


Unnamed: 0,_id,id_str,content,user_loc,user_screen_name,retweet_count,fav_count,created_at,char_count,sentiment_method,sentiment
0,5ffde71f5e4953000d99dd55,1.348735e+18,Virtual learning has caused a fair share of fr...,Wisconsin,Edficiency,0,0,2021-01-11 20:55:39,272,tools,positive
1,5ffde71f5e4953000d99dd58,1.348704e+18,Preparing for high school is an exciting time!...,"St. Augustine, Florida",SJCSD,0,0,2021-01-11 18:51:41,221,tools,positive
3,5ffde71f5e4953000d99dd5d,1.348672e+18,The Nature-based 4K Parent Night for enrollmen...,"Newburg, WI",RiveredgeNC,0,0,2021-01-11 16:42:40,276,tools,positive


In [60]:
# save to file 
q_tweets.to_csv('../data/jan_2021_queried_tweets_TOOL_labeled.csv')

### Now prepare the Tweets that I collected via Streaming over several days in January


In [61]:
s_tweets =  pd.read_csv('../data/jan_streaming_tweets_NO_label.csv')
s_tweets.head(3)

Unnamed: 0,_id,id_str,content,user_loc,user_screen_name,retweet_count,fav_count,created_at
0,5ffde42364e25f9f26125929,1.349054e+18,Open Forum!\n\nVirtual Parent Hangout - Januar...,"Indianapolis, IN",ISDHoosiers,0,0,2021-01-12 18:02:06
1,5ffde4c064e25f9f2612592a,1.349055e+18,@besf0rt Never forget in 2000 profiling a Japa...,Florida hellmouth,ImperialeNancy,0,0,2021-01-12 18:04:43
2,5ffde54d64e25f9f2612592b,1.349055e+18,"“This year, the “mothers and others” are turni...","Suburban DC, Maryland",gunsensemelissa,0,0,2021-01-12 18:07:04


In [62]:
s_tweets.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 386 entries, 0 to 385
Data columns (total 8 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   _id               386 non-null    object 
 1   id_str            386 non-null    float64
 2   content           386 non-null    object 
 3   user_loc          386 non-null    object 
 4   user_screen_name  386 non-null    object 
 5   retweet_count     386 non-null    int64  
 6   fav_count         386 non-null    int64  
 7   created_at        386 non-null    object 
dtypes: float64(1), int64(2), object(5)
memory usage: 24.2+ KB


In [63]:
# Keep only the tweets 100 chars and over
s_tweets['char_count'] = s_tweets.apply(lambda row: len(row['content']), axis=1)
s_tweets = s_tweets[s_tweets['char_count'] >= 100]
print(s_tweets.shape)
s_tweets.head(3)

(347, 9)


Unnamed: 0,_id,id_str,content,user_loc,user_screen_name,retweet_count,fav_count,created_at,char_count
0,5ffde42364e25f9f26125929,1.349054e+18,Open Forum!\n\nVirtual Parent Hangout - Januar...,"Indianapolis, IN",ISDHoosiers,0,0,2021-01-12 18:02:06,265
1,5ffde4c064e25f9f2612592a,1.349055e+18,@besf0rt Never forget in 2000 profiling a Japa...,Florida hellmouth,ImperialeNancy,0,0,2021-01-12 18:04:43,289
2,5ffde54d64e25f9f2612592b,1.349055e+18,"“This year, the “mothers and others” are turni...","Suburban DC, Maryland",gunsensemelissa,0,0,2021-01-12 18:07:04,283


In [65]:
# Get the VADER and TextBlob sentiments
analyzer = SentimentIntensityAnalyzer()
s_tweets['vader_sentiment'] = s_tweets.apply(lambda row: get_vader_sentiment(analyzer, row['content']), axis=1)
s_tweets['text_blob_sentiment'] = s_tweets.apply(lambda row: get_text_blob_sentiment(row['content']), axis=1)

# Mark where the two tools agree
s_tweets['tools_match'] = s_tweets.apply(lambda row: get_tools_match(row['vader_sentiment'], row['text_blob_sentiment']), axis=1)

# keep only the tweets where the tools agree 
s_tweets = s_tweets[s_tweets['tools_match']]
print(s_tweets.shape)
s_tweets.head(3)

(193, 13)


Unnamed: 0,_id,id_str,content,user_loc,user_screen_name,retweet_count,fav_count,created_at,char_count,sentiment_method,vader_sentiment,text_blob_sentiment,tools_match
3,5ffde60d64e25f9f2612592c,1.349056e+18,"If a student is missing class a bunch, their g...","Provo, UT",avatargrace,0,0,2021-01-12 18:10:16,275,tools,negative,negative,True
8,5ffdf9c180d443513a191637,1.349077e+18,"@JamesTodaroMD @ConservMama17 I went to proms,...",Mountains of California,AudreyJeanne,0,0,2021-01-12 19:34:19,308,tools,positive,positive,True
9,5ffdfcbbed845d2d2c79e966,1.34908e+18,Last weeks blog by @RachelJTeaches provides an...,"Arlington, VA",intellispark,0,0,2021-01-12 19:47:01,200,tools,positive,positive,True


In [66]:
# Now create cols for sentiment and sentiment method
s_tweets['sentiment_method'] = 'tools'
s_tweets['sentiment'] = s_tweets.apply(lambda row: row['vader_sentiment'], axis=1)
s_tweets.head(3)

# delete the now un-needed columns
s_tweets.drop('vader_sentiment', axis=1, inplace=True)
s_tweets.drop('text_blob_sentiment', axis=1, inplace=True)
s_tweets.drop('tools_match', axis=1, inplace=True)
print(s_tweets.shape)

(193, 11)


In [67]:
# save to file 
s_tweets.to_csv('../data/jan_2021_streaming_tweets_TOOL_labeled.csv')

### Now create a single file with both external tweets and the tweets collected in January

In [68]:

all_tweets = us_only_data.append(human_labeled_q_tweets, ignore_index=True)
all_tweets.append(q_tweets, ignore_index=True)
all_tweets.append(s_tweets, ignore_index=True)

all_tweets.info()
all_tweets.shape

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 38392 entries, 0 to 38391
Data columns (total 13 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   Unnamed: 0        38044 non-null  float64
 1   Unnamed: 0.1      38044 non-null  float64
 2   content           38392 non-null  object 
 3   user_loc          38392 non-null  object 
 4   user_screen_name  38392 non-null  object 
 5   retweet_count     38392 non-null  int64  
 6   fav_count         38392 non-null  int64  
 7   created_at        38392 non-null  object 
 8   char_count        38392 non-null  int64  
 9   sentiment_method  38392 non-null  object 
 10  sentiment         38392 non-null  object 
 11  _id               348 non-null    object 
 12  id_str            348 non-null    float64
dtypes: float64(3), int64(3), object(7)
memory usage: 3.8+ MB


(38392, 13)

In [69]:
all_tweets.head()


Unnamed: 0.2,Unnamed: 0,Unnamed: 0.1,content,user_loc,user_screen_name,retweet_count,fav_count,created_at,char_count,sentiment_method,sentiment,_id,id_str
0,9.0,9.0,“Instructional Considerations for the 2020-21 ...,"Illinois, USA",Erik_Youngman,0,2,2020-08-02 00:10:26,276,tools,neutral,,
1,10.0,10.0,With all the uncertainty of what September wil...,"Lyndhurst, NJ",Renee_LoBue,0,0,2020-08-01 23:57:31,264,tools,positive,,
2,11.0,11.0,Check this out on Wakelet - Digital learning a...,"Cary, NC",SupriyaVasu,0,0,2020-08-01 23:20:38,133,tools,neutral,,
3,12.0,12.0,Happy Friendship Day!\n#rdnums #nagaland #kohi...,"Kohima, India",rdnums,2,1,2020-08-01 23:17:09,264,tools,positive,,
4,13.0,13.0,Beat the summer heat with over 400 cool games ...,"Providence, RI",ABCyaGames,0,2,2020-08-01 23:00:00,146,tools,positive,,


In [70]:
# drop the columns we don't need
all_tweets.drop('Unnamed: 0', axis=1, inplace=True)
all_tweets.drop('Unnamed: 0.1', axis=1, inplace=True)
all_tweets.drop('_id', axis=1, inplace=True)
all_tweets.drop('id_str', axis=1, inplace=True)

all_tweets.info()


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 38392 entries, 0 to 38391
Data columns (total 9 columns):
 #   Column            Non-Null Count  Dtype 
---  ------            --------------  ----- 
 0   content           38392 non-null  object
 1   user_loc          38392 non-null  object
 2   user_screen_name  38392 non-null  object
 3   retweet_count     38392 non-null  int64 
 4   fav_count         38392 non-null  int64 
 5   created_at        38392 non-null  object
 6   char_count        38392 non-null  int64 
 7   sentiment_method  38392 non-null  object
 8   sentiment         38392 non-null  object
dtypes: int64(3), object(6)
memory usage: 2.6+ MB


In [73]:
# Check for any nulls
all_tweets.isnull().sum()

content             0
user_loc            0
user_screen_name    0
retweet_count       0
fav_count           0
created_at          0
char_count          0
sentiment_method    0
sentiment           0
dtype: int64

In [80]:
# no nulls...now check for duplicate contents
len(all_tweets['content'].unique())

38392

In [81]:
# save all tweets to file 
all_tweets.to_csv('../data/all_tweets_combined.csv')