# Optimizing Grid Search

Efficient grid search is essential to prevent computational slowdowns, especially with expanding search spaces. It's good to consider spped-up. We'll demonstrate this with the GoEmotions dataset, consisting of 58k Reddit comments annotated with 27 emotion categories.

In [1]:
# Download Data
! wget -P data/full_dataset/ https://storage.googleapis.com/gresearch/goemotions/data/full_dataset/goemotions_1.csv

--2024-03-26 10:02:12--  https://storage.googleapis.com/gresearch/goemotions/data/full_dataset/goemotions_1.csv
Resolving storage.googleapis.com (storage.googleapis.com)... 216.58.213.91, 216.58.214.187, 216.58.215.59, ...
Connecting to storage.googleapis.com (storage.googleapis.com)|216.58.213.91|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 14174600 (14M) [application/octet-stream]
Saving to: ‘data/full_dataset/goemotions_1.csv.1’


2024-03-26 10:02:13 (15,6 MB/s) - ‘data/full_dataset/goemotions_1.csv.1’ saved [14174600/14174600]



In [1]:
import pandas as pd

In [5]:
df = pd.read_csv("/scratch/nautilus/users/jmir@ec-nantes.fr/parallel-python/data/full_dataset/goemotions_1.csv")

In [6]:
df.head()

Unnamed: 0,text,id,author,subreddit,link_id,parent_id,created_utc,rater_id,example_very_unclear,admiration,...,love,nervousness,optimism,pride,realization,relief,remorse,sadness,surprise,neutral
0,That game hurt.,eew5j0j,Brdd9,nrl,t3_ajis4z,t1_eew18eq,1548381000.0,1,False,0,...,0,0,0,0,0,0,0,1,0,0
1,>sexuality shouldn’t be a grouping category I...,eemcysk,TheGreen888,unpopularopinion,t3_ai4q37,t3_ai4q37,1548084000.0,37,True,0,...,0,0,0,0,0,0,0,0,0,0
2,"You do right, if you don't care then fuck 'em!",ed2mah1,Labalool,confessions,t3_abru74,t1_ed2m7g7,1546428000.0,37,False,0,...,0,0,0,0,0,0,0,0,0,1
3,Man I love reddit.,eeibobj,MrsRobertshaw,facepalm,t3_ahulml,t3_ahulml,1547965000.0,18,False,0,...,1,0,0,0,0,0,0,0,0,0
4,"[NAME] was nowhere near them, he was by the Fa...",eda6yn6,American_Fascist713,starwarsspeculation,t3_ackt2f,t1_eda65q2,1546669000.0,2,False,0,...,0,0,0,0,0,0,0,0,0,1


In [8]:
df[['text', 'surprise']].sample(3)

Unnamed: 0,text,surprise
37474,I love the way they make my skin feels after I...,0
37334,"Lost my kid to the system, wanna see wha the w...",0
2042,> lots of ppl dont enjoy pvp I feel like these...,0


In [9]:
X = df['text']
y = df['surprise']

In [10]:
import time 
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.decomposition import TruncatedSVD
from sklearn.pipeline import make_pipeline, make_union
from sklearn.linear_model import LogisticRegression

pipe = make_pipeline(
    TfidfVectorizer(),
    TruncatedSVD(n_components=10),
    LogisticRegression(C=0.1),
    memory="cache_demo"
)

In [11]:
pipe.get_params()

{'memory': 'cache_demo',
 'steps': [('tfidfvectorizer', TfidfVectorizer()),
  ('truncatedsvd', TruncatedSVD(n_components=10)),
  ('logisticregression', LogisticRegression(C=0.1))],
 'verbose': False,
 'tfidfvectorizer': TfidfVectorizer(),
 'truncatedsvd': TruncatedSVD(n_components=10),
 'logisticregression': LogisticRegression(C=0.1),
 'tfidfvectorizer__analyzer': 'word',
 'tfidfvectorizer__binary': False,
 'tfidfvectorizer__decode_error': 'strict',
 'tfidfvectorizer__dtype': numpy.float64,
 'tfidfvectorizer__encoding': 'utf-8',
 'tfidfvectorizer__input': 'content',
 'tfidfvectorizer__lowercase': True,
 'tfidfvectorizer__max_df': 1.0,
 'tfidfvectorizer__max_features': None,
 'tfidfvectorizer__min_df': 1,
 'tfidfvectorizer__ngram_range': (1, 1),
 'tfidfvectorizer__norm': 'l2',
 'tfidfvectorizer__preprocessor': None,
 'tfidfvectorizer__smooth_idf': True,
 'tfidfvectorizer__stop_words': None,
 'tfidfvectorizer__strip_accents': None,
 'tfidfvectorizer__sublinear_tf': False,
 'tfidfvectoriz

In [12]:
?TfidfVectorizer

[0;31mInit signature:[0m
[0mTfidfVectorizer[0m[0;34m([0m[0;34m[0m
[0;34m[0m    [0;34m*[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0minput[0m[0;34m=[0m[0;34m'content'[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mencoding[0m[0;34m=[0m[0;34m'utf-8'[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mdecode_error[0m[0;34m=[0m[0;34m'strict'[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mstrip_accents[0m[0;34m=[0m[0;32mNone[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mlowercase[0m[0;34m=[0m[0;32mTrue[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mpreprocessor[0m[0;34m=[0m[0;32mNone[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mtokenizer[0m[0;34m=[0m[0;32mNone[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0manalyzer[0m[0;34m=[0m[0;34m'word'[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mstop_words[0m[0;34m=[0m[0;32mNone[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mtoken_pattern[0m[0;34m=[0m[0;34m'(?u)\\b\\w\\w+\\b'[0m[0;34m,[0m[0;34m[0m
[0;34m

In [13]:
import numpy as np
from sklearn.model_selection import GridSearchCV

grid = GridSearchCV(
    pipe, 
    param_grid={
        "logisticregression__C": np.logspace(0.01, 2, 5), 
        "truncatedsvd__n_components": [10, 20, 50, 100]
    },
    cv=5,
)

In [14]:
%%time 

grid.fit(X, y)

CPU times: user 39.8 s, sys: 538 ms, total: 40.4 s
Wall time: 43.6 s
