# Optimizing Grid Search

Efficient grid search is essential to prevent computational slowdowns, especially with expanding search spaces. It's good to consider spped-up. We'll demonstrate this with the GoEmotions dataset, consisting of 58k Reddit comments annotated with 27 emotion categories.

In [15]:
# Download Data
! wget -P data/full_dataset/ https://storage.googleapis.com/gresearch/goemotions/data/full_dataset/goemotions_1.csv

--2024-03-26 10:33:41--  https://storage.googleapis.com/gresearch/goemotions/data/full_dataset/goemotions_1.csv
Resolving proxy-upgrade.intra.glicid.fr (proxy-upgrade.intra.glicid.fr)... 194.167.60.141
Connecting to proxy-upgrade.intra.glicid.fr (proxy-upgrade.intra.glicid.fr)|194.167.60.141|:3128... connected.
Proxy tunneling failed: ForbiddenUnable to establish SSL connection.


In [16]:
import pandas as pd

In [17]:
df = pd.read_csv("/scratch/nautilus/users/jmir@ec-nantes.fr/parallel-python/data/full_dataset/goemotions_1.csv")

In [18]:
df.head()

Unnamed: 0,text,id,author,subreddit,link_id,parent_id,created_utc,rater_id,example_very_unclear,admiration,...,love,nervousness,optimism,pride,realization,relief,remorse,sadness,surprise,neutral
0,That game hurt.,eew5j0j,Brdd9,nrl,t3_ajis4z,t1_eew18eq,1548381000.0,1,False,0,...,0,0,0,0,0,0,0,1,0,0
1,>sexuality shouldn’t be a grouping category I...,eemcysk,TheGreen888,unpopularopinion,t3_ai4q37,t3_ai4q37,1548084000.0,37,True,0,...,0,0,0,0,0,0,0,0,0,0
2,"You do right, if you don't care then fuck 'em!",ed2mah1,Labalool,confessions,t3_abru74,t1_ed2m7g7,1546428000.0,37,False,0,...,0,0,0,0,0,0,0,0,0,1
3,Man I love reddit.,eeibobj,MrsRobertshaw,facepalm,t3_ahulml,t3_ahulml,1547965000.0,18,False,0,...,1,0,0,0,0,0,0,0,0,0
4,"[NAME] was nowhere near them, he was by the Fa...",eda6yn6,American_Fascist713,starwarsspeculation,t3_ackt2f,t1_eda65q2,1546669000.0,2,False,0,...,0,0,0,0,0,0,0,0,0,1


In [19]:
df[['text', 'surprise']].sample(3)

Unnamed: 0,text,surprise
21158,"hi [NAME], toxicity in a multiplayer game is s...",0
68485,Of course I love myself because I'm awesome.,0
43793,The greater good,0


In [20]:
X = df['text']
y = df['surprise']

In [45]:
import time 
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.decomposition import TruncatedSVD
from sklearn.pipeline import make_pipeline, make_union
from sklearn.linear_model import LogisticRegression

pipe = make_pipeline(
    TfidfVectorizer(),
    TruncatedSVD(n_components=10),
    LogisticRegression(C=0.1),
)

In [29]:
#pipe.get_params()

In [30]:
#?TfidfVectorizer

In [46]:
import numpy as np
from sklearn.model_selection import GridSearchCV

grid = GridSearchCV(
    pipe, 
    param_grid={
        "logisticregression__C": np.logspace(0.01, 2, 5), 
        "truncatedsvd__n_components": [10, 20, 50, 100]
    },
    cv=5,
)

In [33]:
pipe

In [47]:
%%time 

grid.fit(X, y)

CPU times: user 2min 29s, sys: 6.44 s, total: 2min 35s
Wall time: 1min 42s


In [40]:
# Speed-up by Parallelizing

import numpy as np
from sklearn.model_selection import GridSearchCV

grid = GridSearchCV(
    pipe, 
    param_grid={
        "logisticregression__C": np.logspace(0.01, 2, 5), 
        "truncatedsvd__n_components": [10, 20, 50, 100]
    },
    cv=5,
    refit=False,
    n_jobs=4
)

In [41]:
%%time 

grid.fit(X, y)

CPU times: user 943 ms, sys: 81 ms, total: 1.02 s
Wall time: 24.6 s


In [42]:
# Apply caching

import time 
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.decomposition import TruncatedSVD
from sklearn.pipeline import make_pipeline, make_union
from sklearn.linear_model import LogisticRegression


pipe = make_pipeline(
    TfidfVectorizer(),
    TruncatedSVD(n_components=10),
    LogisticRegression(C=0.1),
    memory="cache_demo"
)

In [43]:
# Speed-up by Parallelizing

import numpy as np
from sklearn.model_selection import GridSearchCV

grid = GridSearchCV(
    pipe, 
    param_grid={
        "logisticregression__C": np.logspace(0.01, 2, 5), 
        "truncatedsvd__n_components": [10, 20, 50, 100]
    },
    cv=5,
    refit=False,
    n_jobs=4
)

In [44]:
%%time

grid.fit(X, y)

CPU times: user 914 ms, sys: 55.5 ms, total: 970 ms
Wall time: 23.4 s
