# <font color='orange'>  <center> NLP Project: ML Algorithms on Text Data

## <font color='orange'> Each student must extract tweets from twitter. Perform pre-processing and text representation. Apply ML algorithms for classification/clustering. <br> <br> 1. Creating Datasets <dd> a. Extract 5000 tweets with any 5 search labels of your choice. (1000 each). Eg(#cricket, #football, #basketball, #tennis, #hockey). <dd> b. Create one dataset for all the tweets extracted along with labels as second column. Shuffle the dataset. </dd> <br> <br>2. Pre-processing <dd> a. Clean the data by removing tags, user handles, numbers, and other characters. <dd> b. Stem tokens for basic vectorization <dd> c. Lemma tokens for embeddings </dd> <br> <br> 3. Text representation <dd> a. Vectorise each document in the dataset with tf-idf vectorization with n-grams (use stemmed data). <dd> b. Create document embeddings by summation of word vectors taken from any two pre-trained models. The tokens must be lemmas. </dd> <br> <br> 4. Apply machine learning techniques (any two algorithms) for classification/clustering on <dd> a. 3.a data <dd> b. 3.b data </dd> <br> <br> 5. Evaluate the results (4.a and 4.b) which outperforms. <dd> a. For clustering compare at least 10 records’ label with the clusters created. <dd> b. Present a chart as for classification:

![image.png](attachment:61d3b866-e6a7-4e77-a22d-18e42865dea4.png)

In [1]:
import twitter_info
import tweepy
import time
import json
import pandas as pd
from datetime import datetime

In [2]:
consumer_key = twitter_info.API_Key
consumer_secret = twitter_info.API_Key_Secret
access_token = twitter_info.Access_Token
access_token_secret = twitter_info.Access_Token_Secret
bearer_token = twitter_info.Bearer_Token

In [3]:
client = tweepy.Client(bearer_token = bearer_token)
auth = tweepy.OAuthHandler(consumer_key, consumer_secret)
auth.set_access_token(access_token, access_token_secret)
API = tweepy.API(auth, wait_on_rate_limit=True)

In [4]:
def get_response(query):
    try:
        r = tweepy.Cursor(API.search_tweets, q=query, tweet_mode="extended").items()
    except tweepy.TweepError as e:
        print("Exception Caught: ", e)
        print("Reason: ", e.reason)
        print("Sleeping for 16 minutes")
        time.sleep(960)
        r = tweepy.Cursor(API.search_tweets, q=query, tweet_mode="extended").items()
    return r

In [5]:
s = str(datetime.now())
s = s[:16].replace(" ", "_").replace(":","")

In [6]:
def process_response(class_name, class_file_name):
    global tweets_response_list, df
    tweets_response_list = []
    count = 1
    query = class_name + " lang:en -filter:retweets"
    for response in get_response(query):
        hashtags = [h['text'] for h in response.entities['hashtags']]
        tweets_response_list.append([response.id_str, response.full_text, response.user.name, response.user.screen_name, response.user.location, 
                                     response.user.created_at, response._json['retweet_count'], response._json['favorite_count'], hashtags])
        df = pd.DataFrame(tweets_response_list, columns = ['id_str', 'full_text', 'user_name', 'user_screen_name', 'user_location', 
                                                           'created_at', 'retweet_count', 'favorite_count', 'hashtags'])    
        count = count + 1
        if(len(tweets_response_list) % 100 == 0):
            print(count, end =" ")
            df.to_csv("output/nlp_tweet_"+class_file_name+"_"+s+".csv", index = False)

In [22]:
class_list = ['Star Wars','Marvel Cinematic Universe MCU','MonsterVerse','Wizarding World','DC Extended Universe']
star_wars_keyword_list = []
with open('input/star_wars_keywords.txt') as f:
    star_wars_keyword_list = f.readlines()
star_wars_keyword_list = [key_word.strip() + ' OR ' for key_word in star_wars_keyword_list]
class_1 = ''.join(star_wars_keyword_list)
class_1 = class_1[:-4]
print(class_1)

with open('input/Marvel_Cinematic_Universe_MCU_keywords.txt') as f:
    marvel_keyword_list = f.readlines()
marvel_keyword_list = [key_word.strip() + ' OR ' for key_word in marvel_keyword_list]
class_2 = ''.join(marvel_keyword_list)
class_2 = class_2[:-4]
print(class_2)

with open('input/MonsterVerse_keywords.txt') as f:
    MonsterVerse_keyword_list = f.readlines()
MonsterVerse_keyword_list = [key_word.strip() + ' OR ' for key_word in MonsterVerse_keyword_list]
class_3 = ''.join(MonsterVerse_keyword_list)
class_3 = class_3[:-4]
print(class_3)

Star Wars OR #StarWars OR #DarthVader OR #LukeSkywalker OR #Obi-WanKenobi OR #HanSolo OR #Yoda OR #PrincessLeia OR #R2-D2 OR #Chewbacca OR #DarthMaul OR #C-3PO
MarvelCinematicUniverse OR #MCU OR #Thor OR #AntMan OR #Falcon OR #Hulk OR #LukeCage OR #DoctorStrange OR #Groot OR #Punisher OR #BlackWidow OR #JessicaJones OR #Daredevil OR #CaptainAmerica OR #Loki OR #BlackPanther OR #SpiderMan OR #IronMan OR #ScarletWitch OR #CaptainMarvel OR #Hawkeye OR #StarLord OR #Thanos OR #DeadPool OR #XMen OR #Eternals
MonsterVerse OR King AND Kong OR Anguirus OR Godzilla OR Mecha AND King AND Ghidorah OR Mecha AND godzilla OR King AND Ghidorah OR MUTO OR MonsterX OR Mothra OR Preston AND Packard OR Alan AND Jonah OR Emma AND Russell OR Mark AND Russell OR James AND Conrad OR Mason AND Weaver OR Ford AND Brody OR Madison AND Russell OR Joe AND Brody OR Ishiro AND Serizawa


In [23]:
process_response(class_3, class_list[2])

In [24]:
df = pd.DataFrame(tweets_response_list, columns = ['id_str', 'full_text', 'user_name', 'user_screen_name', 'user_location', 
                                                           'created_at', 'retweet_count', 'favorite_count', 'hashtags'])

In [25]:
df

Unnamed: 0,id_str,full_text,user_name,user_screen_name,user_location,created_at,retweet_count,favorite_count,hashtags


In [11]:
len(df['full_text'].unique())

19261

In [35]:
def process_page(page):
    global tweets_response_list, df, count
    for response in page:
        hashtags = [h['text'] for h in response.entities['hashtags']]
        tweets_response_list.append([response.id_str, response.full_text, response.user.name, response.user.screen_name, response.user.location, 
                                     response.user.created_at, response._json['retweet_count'], response._json['favorite_count'], hashtags])
        df = pd.DataFrame(tweets_response_list, columns = ['id_str', 'full_text', 'user_name', 'user_screen_name', 'user_location', 
                                                           'created_at', 'retweet_count', 'favorite_count', 'hashtags'])    
        count = count + 1
        if(len(tweets_response_list) % 100 == 0):
            print(count, end =" ")
            df.to_csv("output/nlp_tweet_MCU_"+s+".csv", index = False)

In [36]:
page_list = []
global tweets_response_list, df, count
count = 0
tweets_response_list = []
for page in tweepy.Cursor(API.search_tweets, q=class_2 + " lang:en -filter:retweets", count=100, tweet_mode='extended').pages(500):
    page_list.append(page)
    process_page(page)

100 200 300 400 500 600 700 800 900 1000 1100 1200 1300 1400 1500 1600 1700 1800 1900 2000 2100 2200 2300 2400 2500 2600 2700 2800 2900 3000 3100 3200 3300 3400 3500 3600 3700 3800 3900 4000 4100 4200 4300 4400 4500 4600 4700 4800 4900 5000 5100 5200 5300 5400 5500 5600 5700 5800 5900 6000 6100 6200 6300 6400 6500 6600 6700 6800 6900 7000 7100 7200 7300 7400 7500 7600 7700 7800 7900 8000 8100 8200 8300 8400 8500 8600 8700 8800 8900 9000 9100 9200 9300 9400 9500 9600 9700 9800 9900 10000 10100 10200 10300 10400 10500 10600 10700 10800 10900 11000 11100 11200 11300 11400 11500 11600 11700 11800 11900 12000 12100 12200 12300 12400 12500 12600 12700 12800 12900 13000 13100 13200 13300 13400 13500 13600 13700 13800 13900 14000 14100 14200 14300 14400 14500 14600 14700 14800 14900 15000 15100 15200 15300 15400 15500 15600 15700 15800 15900 16000 16100 16200 16300 16400 16500 16600 16700 16800 16900 17000 17100 17200 17300 17400 17500 17600 17700 17800 17900 18000 18100 18200 18300 18400 1850

In [30]:
len(page_list[0])

100

In [34]:
df

Unnamed: 0,id_str,full_text,user_name,user_screen_name,user_location,created_at,retweet_count,favorite_count,hashtags
0,1507794528903876620,Marvel’s Newest Project Introduces MCU’s Green...,Inside the Magic,InsideTheMagic,Theme Parks... and beyond.,2008-03-26 18:22:40+00:00,0,0,"[Nova, GreenLantern, MCU, MarvelStudios, RyanR..."
1,1507794526219427843,Astrid Bloom commission\n\n#xmen #astridbloom ...,Sergei Titov (Commissions are closed),ArtSergeiTitov,"Tbilisi, Georgia",2014-11-05 15:04:56+00:00,0,0,"[xmen, astridbloom, emmafrost, marvel, art, co..."
2,1507794490727219204,"Obsessed with this art by @LoreDeFelici, so sk...",NovaMCU,NovaMCU,,2018-12-18 19:48:56+00:00,0,0,[SpiderMan]
3,1507794359307091973,Two Stages of Sleep. #loki #cat https://t.co/y...,Chris Konrath 💙,chriskonrath,"Lincoln, England",2010-02-14 11:51:02+00:00,0,0,"[loki, cat]"
4,1507794310858637313,Welcome To Organic @Spotify #Music Promotion S...,Rajib Hasan,RajibHa77770218,"Sylhet, Bangladesh",2022-02-27 08:52:37+00:00,0,0,"[Music, Spotify, Wizkid, BTSJIMIN, posiciónalo..."
...,...,...,...,...,...,...,...,...,...
995,1507653038474944519,Quality #comicbooks!! #CGC graded #comics &amp...,Langsyne Comics,langsyneC,"Florida, USA",2019-09-21 13:56:29+00:00,1,0,"[comicbooks, CGC, comics, Marvel, dccomics, Ma..."
996,1507652796178477060,This is perfect. \n\nThere hasn't been a momen...,The Coastal Bend Spider-man,CB_Spidey,"Texas, USA",2018-06-14 04:30:29+00:00,0,1,[spiderman]
997,1507652699869020167,Who did it better? \n#RRRMovie #NTR𓃵 \n@tarak9...,KALUS SHELBY,THOMASMIKEALSON,India,2020-10-19 10:06:48+00:00,0,0,"[RRRMovie, NTR𓃵, KomaramBheemudo, KomaramBheem..."
998,1507651962757656585,"In the Age of Apocalypse, the government team ...",Charlie E/N,charlie_en,"South East, England",2010-06-22 17:44:29+00:00,0,2,"[dailyxmen, xmen]"
