In [147]:
# -------------------------------------------------------
# DAY 1: NLP BASICS + SENTIMENT ANALYSIS MODEL
# -------------------------------------------------------
# Notebook by: <your name>
# Goal: Learn NLP basics and build a sentiment analysis model
# -------------------------------------------------------

In [150]:
# NLP Summary
#- Definition:NLP stands for Natural Language Processing, which is a branch of Data Science and it enables machines to understand and interpret human language.
#- What NLP can do: It can do tasks such as speech recognition, language translation, chatbots and virtual assistants, text summarization, information retrieval and sentiment analysis
#- Real-world examples: Some of the daily use examples include google, siri, google translation, chatgpt and so on.

In [152]:
### Preprocessing Summary
#- Tokenization: It is one of the major step in text preprocessing which splits the text into tokens such as words or sentences
#- Stopwords: Stopwords are the most commonly used/repeated words which won't play that major role in understanding the text, which might create noise so should be removed
#- Stemming vs Lemmatization: Both are used to convert the words to their base/root form but stemming won't considers the meaning/context of root form it just stems irrespective of their context.
#  For example crazy word will be stemmed into crazi which irrelevant, whereas lemmatization follows a predefined words of root forms so it stems the different forms of words to original or root form to understand the meaning.
#- Bag of Words: It is a method of feature extraction which takes the documents as units which consists of bag of words and converts to vector form where words with their frequency(count) will present.
#- TF-IDF: unlike BOW, where all the words are treated equally, but in some cases some particular words may repeat more often than usual due to their text/topic so this method first creates document frequency then it divides the term frequency by the document frequency. 

In [155]:
#Data Loading
import nltk
nltk.download('movie_reviews')
from nltk.corpus import movie_reviews

[nltk_data] Downloading package movie_reviews to
[nltk_data]     C:\Users\kunar\AppData\Roaming\nltk_data...
[nltk_data]   Package movie_reviews is already up-to-date!


In [156]:
movie_reviews.categories()

['neg', 'pos']

In [157]:
len(movie_reviews.fileids('pos')), len(movie_reviews.fileids('neg'))

(1000, 1000)

In [158]:
import pandas as pd

data = []

for category in movie_reviews.categories():
    for fileid in movie_reviews.fileids(category):
        text = movie_reviews.raw(fileid)
        data.append((text, category))

df = pd.DataFrame(data, columns=['text', 'label'])
df.head()

Unnamed: 0,text,label
0,"plot : two teen couples go to a church party , drink and then drive . \nthey get into an accident . \none of the guys dies , but his girlfriend continues to see him in her life , and has nightmares . \nwhat's the deal ? \nwatch the movie and "" sorta "" find out . . . \ncritique : a mind-fuck movie for the teen generation that touches on a very cool idea , but presents it in a very bad package . \nwhich is what makes this review an even harder one to write , since i generally applaud films which attempt to break the mold , mess with your head and such ( lost highway & memento ) , but there are good and bad ways of making all types of films , and these folks just didn't snag this one correctly . \nthey seem to have taken this pretty neat concept , but executed it terribly . \nso what are the problems with the movie ? \nwell , its main problem is that it's simply too jumbled . \nit starts off "" normal "" but then downshifts into this "" fantasy "" world in which you , as an audience member , have no idea what's going on . \nthere are dreams , there are characters coming back from the dead , there are others who look like the dead , there are strange apparitions , there are disappearances , there are a looooot of chase scenes , there are tons of weird things that happen , and most of it is simply not explained . \nnow i personally don't mind trying to unravel a film every now and then , but when all it does is give me the same clue over and over again , i get kind of fed up after a while , which is this film's biggest problem . \nit's obviously got this big secret to hide , but it seems to want to hide it completely until its final five minutes . \nand do they make things entertaining , thrilling or even engaging , in the meantime ? \nnot really . \nthe sad part is that the arrow and i both dig on flicks like this , so we actually figured most of it out by the half-way point , so all of the strangeness after that did start to make a little bit of sense , but it still didn't the make the film all that more entertaining . \ni guess the bottom line with movies like this is that you should always make sure that the audience is "" into it "" even before they are given the secret password to enter your world of understanding . \ni mean , showing melissa sagemiller running away from visions for about 20 minutes throughout the movie is just plain lazy ! ! \nokay , we get it . . . there \nare people chasing her and we don't know who they are . \ndo we really need to see it over and over again ? \nhow about giving us different scenes offering further insight into all of the strangeness going down in the movie ? \napparently , the studio took this film away from its director and chopped it up themselves , and it shows . \nthere might've been a pretty decent teen mind-fuck movie in here somewhere , but i guess "" the suits "" decided that turning it into a music video with little edge , would make more sense . \nthe actors are pretty good for the most part , although wes bentley just seemed to be playing the exact same character that he did in american beauty , only in a new neighborhood . \nbut my biggest kudos go out to sagemiller , who holds her own throughout the entire film , and actually has you feeling her character's unraveling . \noverall , the film doesn't stick because it doesn't entertain , it's confusing , it rarely excites and it feels pretty redundant for most of its runtime , despite a pretty cool ending and explanation to all of the craziness that came before it . \noh , and by the way , this is not a horror or teen slasher flick . . . it's \njust packaged to look that way because someone is apparently assuming that the genre is still hot with the kids . \nit also wrapped production two years ago and has been sitting on the shelves ever since . \nwhatever . . . skip \nit ! \nwhere's joblo coming from ? \na nightmare of elm street 3 ( 7/10 ) - blair witch 2 ( 7/10 ) - the crow ( 9/10 ) - the crow : salvation ( 4/10 ) - lost highway ( 10/10 ) - memento ( 10/10 ) - the others ( 9/10 ) - stir of echoes ( 8/10 ) \n",neg
1,"the happy bastard's quick movie review \ndamn that y2k bug . \nit's got a head start in this movie starring jamie lee curtis and another baldwin brother ( william this time ) in a story regarding a crew of a tugboat that comes across a deserted russian tech ship that has a strangeness to it when they kick the power back on . \nlittle do they know the power within . . . \ngoing for the gore and bringing on a few action sequences here and there , virus still feels very empty , like a movie going for all flash and no substance . \nwe don't know why the crew was really out in the middle of nowhere , we don't know the origin of what took over the ship ( just that a big pink flashy thing hit the mir ) , and , of course , we don't know why donald sutherland is stumbling around drunkenly throughout . \nhere , it's just "" hey , let's chase these people around with some robots "" . \nthe acting is below average , even from the likes of curtis . \nyou're more likely to get a kick out of her work in halloween h20 . \nsutherland is wasted and baldwin , well , he's acting like a baldwin , of course . \nthe real star here are stan winston's robot design , some schnazzy cgi , and the occasional good gore shot , like picking into someone's brain . \nso , if robots and body parts really turn you on , here's your movie . \notherwise , it's pretty much a sunken ship of a movie . \n",neg
2,"it is movies like these that make a jaded movie viewer thankful for the invention of the timex indiglo watch . \nbased on the late 1960's television show by the same name , the mod squad tells the tale of three reformed criminals under the employ of the police to go undercover . \nhowever , things go wrong as evidence gets stolen and they are immediately under suspicion . \nof course , the ads make it seem like so much more . \nquick cuts , cool music , claire dane's nice hair and cute outfits , car chases , stuff blowing up , and the like . \nsounds like a cool movie , does it not ? \nafter the first fifteen minutes , it quickly becomes apparent that it is not . \nthe mod squad is certainly a slick looking production , complete with nice hair and costumes , but that simply isn't enough . \nthe film is best described as a cross between an hour-long cop show and a music video , both stretched out into the span of an hour and a half . \nand with it comes every single clich ? . \nit doesn't really matter that the film is based on a television show , as most of the plot elements have been recycled from everything we've already seen . \nthe characters and acting is nothing spectacular , sometimes even bordering on wooden . \nclaire danes and omar epps deliver their lines as if they are bored , which really transfers onto the audience . \nthe only one to escape relatively unscathed is giovanni ribisi , who plays the resident crazy man , ultimately being the only thing worth watching . \nunfortunately , even he's not enough to save this convoluted mess , as all the characters don't do much apart from occupying screen time . \nwith the young cast , cool clothes , nice hair , and hip soundtrack , it appears that the film is geared towards the teenage mindset . \ndespite an american 'r' rating ( which the content does not justify ) , the film is way too juvenile for the older mindset . \ninformation on the characters is literally spoon-fed to the audience ( would it be that hard to show us instead of telling us ? ) , dialogue is poorly written , and the plot is extremely predictable . \nthe way the film progresses , you likely won't even care if the heroes are in any jeopardy , because you'll know they aren't . \nbasing the show on a 1960's television show that nobody remembers is of questionable wisdom , especially when one considers the target audience and the fact that the number of memorable films based on television shows can be counted on one hand ( even one that's missing a finger or two ) . \nthe number of times that i checked my watch ( six ) is a clear indication that this film is not one of them . \nit is clear that the film is nothing more than an attempt to cash in on the teenage spending dollar , judging from the rash of really awful teen-flicks that we've been seeing as of late . \navoid this film at all costs . \n",neg
3,""" quest for camelot "" is warner bros . ' first feature-length , fully-animated attempt to steal clout from disney's cartoon empire , but the mouse has no reason to be worried . \nthe only other recent challenger to their throne was last fall's promising , if flawed , 20th century fox production "" anastasia , "" but disney's "" hercules , "" with its lively cast and colorful palate , had her beat hands-down when it came time to crown 1997's best piece of animation . \nthis year , it's no contest , as "" quest for camelot "" is pretty much dead on arrival . \neven the magic kingdom at its most mediocre -- that'd be "" pocahontas "" for those of you keeping score -- isn't nearly as dull as this . \nthe story revolves around the adventures of free-spirited kayley ( voiced by jessalyn gilsig ) , the early-teen daughter of a belated knight from king arthur's round table . \nkayley's only dream is to follow in her father's footsteps , and she gets her chance when evil warlord ruber ( gary oldman ) , an ex-round table member-gone-bad , steals arthur's magical sword excalibur and accidentally loses it in a dangerous , booby-trapped forest . \nwith the help of hunky , blind timberland-dweller garrett ( carey elwes ) and a two-headed dragon ( eric idle and don rickles ) that's always arguing with itself , kayley just might be able to break the medieval sexist mold and prove her worth as a fighter on arthur's side . \n "" quest for camelot "" is missing pure showmanship , an essential element if it's ever expected to climb to the high ranks of disney . \nthere's nothing here that differentiates "" quest "" from something you'd see on any given saturday morning cartoon -- subpar animation , instantly forgettable songs , poorly-integrated computerized footage . \n ( compare kayley and garrett's run-in with the angry ogre to herc's battle with the hydra . \ni rest my case . ) \neven the characters stink -- none of them are remotely interesting , so much that the film becomes a race to see which one can out-bland the others . \nin the end , it's a tie -- they all win . \nthat dragon's comedy shtick is awfully cloying , but at least it shows signs of a pulse . \nat least fans of the early-'90s tgif television line-up will be thrilled to find jaleel "" urkel "" white and bronson "" balki "" pinchot sharing the same footage . \na few scenes are nicely realized ( though i'm at a loss to recall enough to be specific ) , and the actors providing the voice talent are enthusiastic ( though most are paired up with singers who don't sound a thing like them for their big musical moments -- jane seymour and celine dion ? ? ? ) . \nbut one must strain through too much of this mess to find the good . \naside from the fact that children will probably be as bored watching this as adults , "" quest for camelot "" 's most grievous error is its complete lack of personality . \nand personality , we learn from this mess , goes a very long way . \n",neg
4,"synopsis : a mentally unstable man undergoing psychotherapy saves a boy from a potentially fatal accident and then falls in love with the boy's mother , a fledgling restauranteur . \nunsuccessfully attempting to gain the woman's favor , he takes pictures of her and kills a number of people in his way . \ncomments : stalked is yet another in a seemingly endless string of spurned-psychos-getting-their-revenge type movies which are a stable category in the 1990s film industry , both theatrical and direct-to-video . \ntheir proliferation may be due in part to the fact that they're typically inexpensive to produce ( no special effects , no big name stars ) and serve as vehicles to flash nudity ( allowing them to frequent late-night cable television ) . \nstalked wavers slightly from the norm in one respect : the psycho never actually has an affair ; on the contrary , he's rejected rather quickly ( the psycho typically is an ex-lover , ex-wife , or ex-husband ) . \nother than that , stalked is just another redundant entry doomed to collect dust on video shelves and viewed after midnight on cable . \nstalked does not provide much suspense , though that is what it sets out to do . \ninterspersed throughout the opening credits , for instance , a serious-sounding narrator spouts statistics about stalkers and ponders what may cause a man to stalk ( it's implicitly implied that all stalkers are men ) while pictures of a boy are shown on the screen . \nafter these credits , a snapshot of actor jay underwood appears . \nthe narrator states that "" this is the story of daryl gleason "" and tells the audience that he is the stalker . \nof course , really , this is the story of restauranteur brooke daniels . \nif the movie was meant to be about daryl , then it should have been called stalker not stalked . \nokay . so we know who the stalker is even before the movie starts ; no guesswork required . \nstalked proceeds , then , as it begins : obvious , obvious , obvious . \nthe opening sequence , contrived quite a bit , brings daryl and brooke ( the victim ) together . \ndaryl obsesses over brooke , follows her around , and tries to woo her . \nultimately rejected by her , his plans become more and more desperate and elaborate . \nthese plans include the all-time , psycho-in-love , cliche : the murdered pet . \nfor some reason , this genre's films require a dead pet to be found by the victim stalked . \nstalked is no exception ( it's a cat this time -- found in the shower ) . \nevents like these lead to the inevitable showdown between stalker and stalked , where only one survives ( guess who it invariably always is and you'll guess the conclusion to this turkey ) . \nstalked's cast is uniformly adequate : not anything to write home about but also not all that bad either . \njay underwood , as the stalker , turns toward melodrama a bit too much . \nhe overdoes it , in other words , but he still manages to be creepy enough to pass as the type of stalker the story demands . \nmaryam d'abo , about the only actor close to being a star here ( she played the bond chick in the living daylights ) , is equally adequate as the "" stalked "" of the title , even though she seems too ditzy at times to be a strong , independent business-owner . \nbrooke ( d'abo ) needs to be ditzy , however , for the plot to proceed . \ntoward the end , for example , brooke has her suspicions about daryl . \nto ensure he won't use it as another excuse to see her , brooke decides to return a toolbox he had left at her place to his house . \ndoes she just leave the toolbox at the door when no one answers ? \nof course not . \nshe tries the door , opens it , and wanders around the house . \nwhen daryl returns , he enters the house , of course , so our heroine is in danger . \nsomehow , even though her car is parked at the front of the house , right by the front door , daryl is oblivious to her presence inside . \nthe whole episode places an incredible strain on the audience's suspension of disbelief and questions the validity of either character's intelligence . \nstalked receives two stars because , even though it is highly derivative and somewhat boring , it is not so bad that it cannot be watched . \nrated r mostly for several murder scenes and brief nudity in a strip bar , it is not as offensive as many other thrillers in this genre are . \nif you're in the mood for a good suspense film , though , stake out something else . \n",neg


In [159]:
df['label'] = df['label'].map({'neg': 0, 'pos': 1})

In [160]:
df.shape

(2000, 2)

In [161]:
#Text Preprocessing
import re
import nltk
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer

nltk.download("stopwords")
nltk.download("wordnet")

stop = set(stopwords.words("english"))
lemmatizer = WordNetLemmatizer()

def clean_text(text):
    text = re.sub(r"[^a-zA-Z]", " ", text)
    text = text.lower()
    tokens = text.split()
    tokens = [w for w in tokens if w not in stop]
    tokens = [lemmatizer.lemmatize(w) for w in tokens]
    return " ".join(tokens)

df['clean_text'] = df['text'].apply(clean_text)


[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\kunar\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\kunar\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


In [162]:
#Feature Extraction
from sklearn.feature_extraction.text import TfidfVectorizer

vectorizer = TfidfVectorizer(max_features=5000, ngram_range=(1,2))
X = vectorizer.fit_transform(df['clean_text'])
y = df['label']

In [163]:
#Modelling
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

In [164]:
from sklearn.svm import LinearSVC
model = LinearSVC()
model.fit(X_train, y_train)

0,1,2
,penalty,'l2'
,loss,'squared_hinge'
,dual,'auto'
,tol,0.0001
,C,1.0
,multi_class,'ovr'
,fit_intercept,True
,intercept_scaling,1
,class_weight,
,verbose,0


In [165]:
#Evaluation
from sklearn.metrics import accuracy_score, classification_report

y_pred = model.predict(X_test)

print("Accuracy:", accuracy_score(y_test, y_pred))
print(classification_report(y_test, y_pred))

Accuracy: 0.845
              precision    recall  f1-score   support

           0       0.86      0.82      0.84       199
           1       0.83      0.87      0.85       201

    accuracy                           0.84       400
   macro avg       0.85      0.84      0.84       400
weighted avg       0.85      0.84      0.84       400



In [166]:
#prediction
print(model.predict(vectorizer.transform([
    "I loved this movie!",
    "This was the worst movie ever.",
    "Not bad but not great either."
])))

[1 0 0]


In [167]:
#Another Prediction
test = [
    "good", "bad", "love", "hate", "terrible", 
    "amazing", "this is bad", "this is good"
]

print(model.predict(vectorizer.transform(test)))


[1 0 1 1 0 1 0 1]
