# Step 1: Collecting Datasets

First, I need to collect all of my necessary datasets.

For this project, I am going to be using these datasets (sourced from Kaggle):



*   WomensMarch Tag Tweets (https://www.kaggle.com/datasets/adhok93/inauguration-and-womensmarch-tweets)
*   2020–2021 Indian farmers' protest (https://www.kaggle.com/datasets/prathamsharma123/farmers-protest-tweets-dataset-csv)



In [None]:
from google.colab import drive
drive.mount('/content/drive')

In [None]:
! pip install kaggle

In [None]:
! mkdir ~/.kaggle

In [None]:
! cp /content/drive/MyDrive/kaggle.json ~/.kaggle/kaggle.json

In [None]:
! kaggle datasets download adhok93/inauguration-and-womensmarch-tweets

In [None]:
!unzip inauguration-and-womensmarch-tweets.zip

In [None]:
! kaggle datasets download prathamsharma123/farmers-protest-tweets-dataset-csv

In [None]:
!unzip farmers-protest-tweets-dataset-csv.zip

Now that we've uploaded the two datasets to this Colab Notebook, let's take a look at what they look like:

In [10]:
import pandas as pd

womenmarch = pd.read_csv("womenmarch.csv",encoding='ISO-8859-1')
womenmarch.head()

Unnamed: 0.1,Unnamed: 0,text,favorited,favoriteCount,replyToSN,created,truncated,replyToSID,id,replyToUID,statusSource,screenName,retweetCount,isRetweet,retweeted,longitude,latitude
0,1,"So far, I think 2017 has been the year of the ...",False,0,,2017-02-08 05:14:24,False,,829196660546949120,,TweetDeck,KubeJ9,0,False,False,,
1,2,I'm attending a @theactionnet event: Stanislau...,False,0,,2017-02-08 05:14:18,False,,829196636396072960,,Twitter Web Client,AnnStrahm,0,False,False,,
2,3,RT @AmyMek: Meanwhile In Washington State... (...,False,0,,2017-02-08 05:14:17,False,,829196631920766977,,Twitter for Android,czarnylks,481,True,False,,
3,4,"Bet #womensmarch co-founder , l sarsour has lo...",False,0,,2017-02-08 05:14:07,True,,829196591441534976,,Twitter for iPhone,KennethECollins,0,False,False,,
4,5,RT @yashar: WATCH: In her first video statemen...,False,0,,2017-02-08 05:13:58,False,,829196553558622210,,Twitter for iPhone,Shanaynay_fuck,1836,True,False,,


In [11]:
farmers = pd.read_csv("tweets.csv",encoding='ISO-8859-1')
farmers.head()

Unnamed: 0,tweetUrl,date,renderedContent,tweetId,userId,replyCount,retweetCount,likeCount,quoteCount,source,media,retweetedTweet,quotedTweet,mentionedUsers
0,https://twitter.com/ShashiRajbhar6/status/1376...,2021-03-30 03:33:46+00:00,Support ð\n\n#FarmersProtest,1.376739e+18,1.01597e+18,0,0,0,0,"<a href=""http://twitter.com/download/android"" ...",,,,
1,https://twitter.com/kaursuk06272818/status/137...,2021-03-30 03:33:23+00:00,Supporting farmers means supporting our countr...,1.376739e+18,1.332937e+18,0,0,0,0,"<a href=""http://twitter.com/download/android"" ...",[{'previewUrl': 'https://pbs.twimg.com/media/E...,,,
2,https://twitter.com/kaursuk06272818/status/137...,2021-03-30 03:31:00+00:00,Support farmers if you are related to food #St...,1.376739e+18,1.332937e+18,0,0,0,0,"<a href=""http://twitter.com/download/android"" ...",[{'previewUrl': 'https://pbs.twimg.com/media/E...,,,
3,https://twitter.com/SukhdevSingh_/status/13767...,2021-03-30 03:30:45+00:00,#StopHateAgainstFarmers support #FarmersProtes...,1.376739e+18,1.308357e+18,0,1,3,0,"<a href=""http://twitter.com/download/android"" ...",,,,
4,https://twitter.com/Davidmu66668113/status/137...,2021-03-30 03:30:30+00:00,"You hate farmers I hate you, \nif you love the...",1.376739e+18,1.357312e+18,0,0,1,0,"<a href=""http://twitter.com/download/android"" ...",,,,


Let's convert them to dictionaries.

In [12]:
womenmarch = womenmarch.to_dict()

In [13]:
print(womenmarch.keys())

dict_keys(['Unnamed: 0', 'text', 'favorited', 'favoriteCount', 'replyToSN', 'created', 'truncated', 'replyToSID', 'id', 'replyToUID', 'statusSource', 'screenName', 'retweetCount', 'isRetweet', 'retweeted', 'longitude', 'latitude'])


In [14]:
farmers = farmers.to_dict()

In [15]:
print(farmers.keys())

dict_keys(['tweetUrl', 'date', 'renderedContent', 'tweetId', 'userId', 'replyCount', 'retweetCount', 'likeCount', 'quoteCount', 'source', 'media', 'retweetedTweet', 'quotedTweet', 'mentionedUsers'])


Since there are some strange characters and URLs within the tweets, we don't want them to inferere with our analysis. Let's get rid of them ([look here](https://www.geeksforgeeks.org/python-removing-unwanted-characters-from-string/) and [here](https://stackoverflow.com/questions/43358857/how-to-remove-special-characters-except-space-from-a-file-in-python) for more info):

In [16]:
import re

for col in womenmarch:
  if col == 'text':
    womenmarch[col] = [re.sub(r'[^a-zA-Z0-9\s]+', '', re.sub(r'http\S+', '', tweet)) for tweet in womenmarch[col].values()]
  else:
    womenmarch[col] = list(womenmarch[col].values())

In [17]:
for col in farmers:
  if col == 'renderedContent':
    farmers[col] = [re.sub(r'[^a-zA-Z0-9\s]+', '', re.sub(r'http\S+', '', tweet)) for tweet in farmers[col].values()]
  else:
    farmers[col] = list(farmers[col].values())

Now, we'll convert them to Dataset objects.

In [18]:
! pip install transformers datasets

Collecting transformers
  Downloading transformers-4.31.0-py3-none-any.whl (7.4 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.4/7.4 MB[0m [31m14.4 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting datasets
  Downloading datasets-2.14.4-py3-none-any.whl (519 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m519.3/519.3 kB[0m [31m19.0 MB/s[0m eta [36m0:00:00[0m
Collecting huggingface-hub<1.0,>=0.14.1 (from transformers)
  Downloading huggingface_hub-0.16.4-py3-none-any.whl (268 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m268.8/268.8 kB[0m [31m15.8 MB/s[0m eta [36m0:00:00[0m
Collecting tokenizers!=0.11.3,<0.14,>=0.11.1 (from transformers)
  Downloading tokenizers-0.13.3-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (7.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.8/7.8 MB[0m [31m37.3 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting safetensors>=0.3.1 (from transformers)
  Download

In [19]:
from datasets import Dataset

womenmarch = Dataset.from_pandas(pd.DataFrame(data=womenmarch))
farmers = Dataset.from_pandas(pd.DataFrame(data=farmers))

In [20]:
print(womenmarch["text"][0])

So far I think 2017 has been the year of the women letlizspeak SallyYates WomensMarch melissamccarthy


In [21]:
print(farmers["renderedContent"][0])

Support 

FarmersProtest


Great! Now, we can move on to the next step: fine-tuning our model.