##  Dealing with categories--> adult.csv

In [73]:
import pandas as pd

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import OneHotEncoder
from sklearn.linear_model import LogisticRegression

Read the data

In [74]:
data = pd.read_csv("data/adult.csv.")
data.head()

Unnamed: 0,age,workclass,fnlwgt,education,educational-num,marital-status,occupation,relationship,race,gender,capital-gain,capital-loss,hours-per-week,native-country,income
0,25,Private,226802,11th,7,Never-married,Machine-op-inspct,Own-child,Black,Male,0,0,40,United-States,<=50K
1,38,Private,89814,HS-grad,9,Married-civ-spouse,Farming-fishing,Husband,White,Male,0,0,50,United-States,<=50K
2,28,Local-gov,336951,Assoc-acdm,12,Married-civ-spouse,Protective-serv,Husband,White,Male,0,0,40,United-States,>50K
3,44,Private,160323,Some-college,10,Married-civ-spouse,Machine-op-inspct,Husband,Black,Male,7688,0,40,United-States,>50K
4,18,?,103497,Some-college,10,Never-married,?,Own-child,White,Female,0,0,30,United-States,<=50K


In [75]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 48842 entries, 0 to 48841
Data columns (total 15 columns):
 #   Column           Non-Null Count  Dtype 
---  ------           --------------  ----- 
 0   age              48842 non-null  int64 
 1   workclass        48842 non-null  object
 2   fnlwgt           48842 non-null  int64 
 3   education        48842 non-null  object
 4   educational-num  48842 non-null  int64 
 5   marital-status   48842 non-null  object
 6   occupation       48842 non-null  object
 7   relationship     48842 non-null  object
 8   race             48842 non-null  object
 9   gender           48842 non-null  object
 10  capital-gain     48842 non-null  int64 
 11  capital-loss     48842 non-null  int64 
 12  hours-per-week   48842 non-null  int64 
 13  native-country   48842 non-null  object
 14  income           48842 non-null  object
dtypes: int64(6), object(9)
memory usage: 5.6+ MB


Let's arrange data in X and y

When income is classified as ">50k" or "<=50k," it is typically treated as a categorical variable. In this classification, individuals are grouped into one of two categories based on their income:

">50k": This category represents individuals whose income is greater than $50,000 per year.

"<=50k": This category represents individuals whose income is $50,000 or less per year.

In this context, income is treated as a categorical variable because it is not measured as a continuous value but rather as a binary label based on a specific income threshold. 

In [76]:
X = data.drop(columns=["income"])
y = data["income"] #what you want to predict

In [77]:
X.head()

Unnamed: 0,age,workclass,fnlwgt,education,educational-num,marital-status,occupation,relationship,race,gender,capital-gain,capital-loss,hours-per-week,native-country
0,25,Private,226802,11th,7,Never-married,Machine-op-inspct,Own-child,Black,Male,0,0,40,United-States
1,38,Private,89814,HS-grad,9,Married-civ-spouse,Farming-fishing,Husband,White,Male,0,0,50,United-States
2,28,Local-gov,336951,Assoc-acdm,12,Married-civ-spouse,Protective-serv,Husband,White,Male,0,0,40,United-States
3,44,Private,160323,Some-college,10,Married-civ-spouse,Machine-op-inspct,Husband,Black,Male,7688,0,40,United-States
4,18,?,103497,Some-college,10,Never-married,?,Own-child,White,Female,0,0,30,United-States


In [78]:
y.head()

0    <=50K
1    <=50K
2     >50K
3     >50K
4    <=50K
Name: income, dtype: object

Let's split data into train and test

In [79]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=99)

In [80]:
X_train.shape

(34189, 14)

In [81]:
y_train.shape

(34189,)

In [82]:
X_test.shape

(14653, 14)

## Let's preprocess categorical columns

In [83]:
X_train_cat = X_train.select_dtypes("O")

This is a pandas DataFrame method that allows you to select columns of a specific data type. In this case, "O" stands for "object," which is a common data type used for categorical variables in pandas.

In [84]:
X_train_cat.head()

Unnamed: 0,workclass,education,marital-status,occupation,relationship,race,gender,native-country
5765,Self-emp-inc,Prof-school,Married-civ-spouse,Prof-specialty,Husband,White,Male,United-States
2336,Private,Assoc-voc,Married-civ-spouse,Sales,Husband,White,Male,United-States
22156,Self-emp-not-inc,Some-college,Married-civ-spouse,Machine-op-inspct,Husband,White,Male,United-States
38574,Self-emp-not-inc,Bachelors,Married-civ-spouse,Farming-fishing,Husband,White,Male,United-States
43755,Private,Assoc-acdm,Married-civ-spouse,Craft-repair,Husband,White,Male,United-States


In [85]:
ohe = OneHotEncoder(sparse_output=False)

- OneHotEncoder: This is a class from scikit-learn used to **perform one-hot encoding on categorical variables**. One-hot encoding is a technique used to **convert categorical variables into a binary (0/1) representation suitable for machine learning algorithms.**

- sparse_output=False: The sparse_output parameter is set to False, which means that the output of the OneHotEncoder will be a dense (non-sparse) array or matrix. In the context of one-hot encoding, setting sparse_output to False means that the encoding will produce a regular (dense) numpy array or matrix with 0s and 1s, where each row represents an observation (sample) and each column represents a specific category within the categorical variable.

In [86]:
cat_data_ohe = ohe.fit_transform(X_train_cat)

In [87]:
cat_data_ohe

array([[0., 0., 0., ..., 1., 0., 0.],
       [0., 0., 0., ..., 1., 0., 0.],
       [0., 0., 0., ..., 1., 0., 0.],
       ...,
       [0., 0., 0., ..., 1., 0., 0.],
       [0., 0., 1., ..., 1., 0., 0.],
       [0., 0., 0., ..., 1., 0., 0.]])

In [88]:
cat_data_ohe.shape

(34189, 102)

In [89]:
cat_data_ohe = pd.DataFrame(cat_data_ohe, columns=ohe.get_feature_names_out())

In [90]:
cat_data_ohe.head()

Unnamed: 0,workclass_?,workclass_Federal-gov,workclass_Local-gov,workclass_Never-worked,workclass_Private,workclass_Self-emp-inc,workclass_Self-emp-not-inc,workclass_State-gov,workclass_Without-pay,education_10th,...,native-country_Portugal,native-country_Puerto-Rico,native-country_Scotland,native-country_South,native-country_Taiwan,native-country_Thailand,native-country_Trinadad&Tobago,native-country_United-States,native-country_Vietnam,native-country_Yugoslavia
0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0
1,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0
4,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0


In [91]:
X_train_full = pd.concat([X_train.reset_index(drop=True), cat_data_ohe], axis=1) #concat with ORIGINAL DF

In [92]:
X_train_full.head()

Unnamed: 0,age,workclass,fnlwgt,education,educational-num,marital-status,occupation,relationship,race,gender,...,native-country_Portugal,native-country_Puerto-Rico,native-country_Scotland,native-country_South,native-country_Taiwan,native-country_Thailand,native-country_Trinadad&Tobago,native-country_United-States,native-country_Vietnam,native-country_Yugoslavia
0,59,Self-emp-inc,36085,Prof-school,15,Married-civ-spouse,Prof-specialty,Husband,White,Male,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0
1,38,Private,189922,Assoc-voc,11,Married-civ-spouse,Sales,Husband,White,Male,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0
2,41,Self-emp-not-inc,120539,Some-college,10,Married-civ-spouse,Machine-op-inspct,Husband,White,Male,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0
3,45,Self-emp-not-inc,28497,Bachelors,13,Married-civ-spouse,Farming-fishing,Husband,White,Male,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0
4,30,Private,108386,Assoc-acdm,12,Married-civ-spouse,Craft-repair,Husband,White,Male,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0


Remove original categorical columns

In [93]:
X_train_full = X_train_full.drop(columns=X_train_cat.columns) #REMOVE ORIGINAL CATEGORICAL COLUMNS

In [94]:
X_train_full.head()

Unnamed: 0,age,fnlwgt,educational-num,capital-gain,capital-loss,hours-per-week,workclass_?,workclass_Federal-gov,workclass_Local-gov,workclass_Never-worked,...,native-country_Portugal,native-country_Puerto-Rico,native-country_Scotland,native-country_South,native-country_Taiwan,native-country_Thailand,native-country_Trinadad&Tobago,native-country_United-States,native-country_Vietnam,native-country_Yugoslavia
0,59,36085,15,15024,0,60,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0
1,38,189922,11,0,0,50,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0
2,41,120539,10,3103,0,40,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0
3,45,28497,13,0,1485,70,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0
4,30,108386,12,0,0,40,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0


In [95]:
X_train_full.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 34189 entries, 0 to 34188
Columns: 108 entries, age to native-country_Yugoslavia
dtypes: float64(102), int64(6)
memory usage: 28.2 MB


Build a `LogisticRegression` model

In [96]:
lr = LogisticRegression()

In [97]:
lr.fit(X_train_full, y_train) #TRAIN WITH FEATURES ADJUSTED

### Perform the same operation but for testing

In [98]:
X_test_cat = X_test.select_dtypes("O")

In [99]:
X_test_cat.head()

Unnamed: 0,workclass,education,marital-status,occupation,relationship,race,gender,native-country
20515,Private,Some-college,Widowed,Exec-managerial,Unmarried,White,Female,United-States
356,Private,Bachelors,Never-married,Sales,Not-in-family,White,Male,United-States
7772,Private,Some-college,Never-married,Adm-clerical,Not-in-family,White,Female,United-States
34450,State-gov,HS-grad,Married-civ-spouse,Adm-clerical,Wife,White,Female,United-States
19643,Private,10th,Never-married,Other-service,Own-child,Black,Female,United-States


In [100]:
X_test_ohe = ohe.transform(X_test_cat)

In [101]:
X_test_ohe = pd.DataFrame(X_test_ohe, columns=ohe.get_feature_names_out())

In [102]:
X_test_ohe.head()

Unnamed: 0,workclass_?,workclass_Federal-gov,workclass_Local-gov,workclass_Never-worked,workclass_Private,workclass_Self-emp-inc,workclass_Self-emp-not-inc,workclass_State-gov,workclass_Without-pay,education_10th,...,native-country_Portugal,native-country_Puerto-Rico,native-country_Scotland,native-country_South,native-country_Taiwan,native-country_Thailand,native-country_Trinadad&Tobago,native-country_United-States,native-country_Vietnam,native-country_Yugoslavia
0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0
1,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0
2,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0
4,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0


In [103]:
X_test_full = pd.concat([X_test.reset_index(drop=True), X_test_ohe], axis=1)

In [104]:
X_test_full.head()

Unnamed: 0,age,workclass,fnlwgt,education,educational-num,marital-status,occupation,relationship,race,gender,...,native-country_Portugal,native-country_Puerto-Rico,native-country_Scotland,native-country_South,native-country_Taiwan,native-country_Thailand,native-country_Trinadad&Tobago,native-country_United-States,native-country_Vietnam,native-country_Yugoslavia
0,31,Private,73796,Some-college,10,Widowed,Exec-managerial,Unmarried,White,Female,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0
1,33,Private,90409,Bachelors,13,Never-married,Sales,Not-in-family,White,Male,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0
2,21,Private,29810,Some-college,10,Never-married,Adm-clerical,Not-in-family,White,Female,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0
3,41,State-gov,176663,HS-grad,9,Married-civ-spouse,Adm-clerical,Wife,White,Female,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0
4,29,Private,136277,10th,6,Never-married,Other-service,Own-child,Black,Female,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0


In [105]:
X_test_full = X_test_full.drop(columns=X_test_cat.columns)

In [106]:
X_test_full.head()

Unnamed: 0,age,fnlwgt,educational-num,capital-gain,capital-loss,hours-per-week,workclass_?,workclass_Federal-gov,workclass_Local-gov,workclass_Never-worked,...,native-country_Portugal,native-country_Puerto-Rico,native-country_Scotland,native-country_South,native-country_Taiwan,native-country_Thailand,native-country_Trinadad&Tobago,native-country_United-States,native-country_Vietnam,native-country_Yugoslavia
0,31,73796,10,0,0,30,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0
1,33,90409,13,0,0,45,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0
2,21,29810,10,0,0,40,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0
3,41,176663,9,0,0,40,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0
4,29,136277,6,0,0,32,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0


### Prediction but with X test w/o categorical

In [107]:
lr.predict(X_test_full)

array(['<=50K', '<=50K', '<=50K', ..., '<=50K', '>50K', '<=50K'],
      dtype=object)

In [108]:
lr.score(X_test_full, y_test)

0.8004504197092746

## EXERCISE

Load the **20 newsgroups** dataset from `scikit-learn` with the code below.
1. Build a classification model (`LogisticRegression`) on the training set
2. Load the "test" set a use your model to `predict` the nex texts' category
3. Calculate the `accuracy` of the model on the test set



In [109]:
import pandas as pd

In [110]:
from sklearn.datasets import fetch_20newsgroups

In [111]:
# first load the dataset from sklearn package
data = fetch_20newsgroups(subset="train", remove=("headers", "footers", "quotes"))
text = data["data"]
target = data["target"]
target_names = dict(enumerate(data["target_names"]))

In [112]:
# prepare data in a DataFrame

data = pd.DataFrame({
    "text": text,
    "target": target
})

data.target = data.target.replace(target_names)

Finally, you apply the replacement of target values in the "target" column of the DataFrame data using the target_names dictionary

This code replaces the numeric target labels with their corresponding category names, making the dataset more interpretable and human-friendly.

In [113]:
data.head()

Unnamed: 0,text,target
0,I was wondering if anyone out there could enli...,rec.autos
1,A fair number of brave souls who upgraded thei...,comp.sys.mac.hardware
2,"well folks, my mac plus finally gave up the gh...",comp.sys.mac.hardware
3,\nDo you have Weitek's address/phone number? ...,comp.graphics
4,"From article <C5owCB.n3p@world.std.com>, by to...",sci.space


In [114]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 11314 entries, 0 to 11313
Data columns (total 2 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   text    11314 non-null  object
 1   target  11314 non-null  object
dtypes: object(2)
memory usage: 176.9+ KB


In [115]:
# to print the text of one particular sample

print(data.iloc[1].text)

A fair number of brave souls who upgraded their SI clock oscillator have
shared their experiences for this poll. Please send a brief message detailing
your experiences with the procedure. Top speed attained, CPU rated speed,
add on cards and adapters, heat sinks, hour of usage per day, floppy disk
functionality with 800 and 1.4 m floppies are especially requested.

I will be summarizing in the next two days, so please add to the network
knowledge base if you have done the clock upgrade and haven't answered this
poll. Thanks.


## SOLUTION

In [116]:
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer, ENGLISH_STOP_WORDS
from sklearn.linear_model import LogisticRegression

In [117]:
X = data[["text"]]
y = data["target"]

In [118]:
X.head()

Unnamed: 0,text
0,I was wondering if anyone out there could enli...
1,A fair number of brave souls who upgraded thei...
2,"well folks, my mac plus finally gave up the gh..."
3,\nDo you have Weitek's address/phone number? ...
4,"From article <C5owCB.n3p@world.std.com>, by to..."


In [119]:
y.head()

0                rec.autos
1    comp.sys.mac.hardware
2    comp.sys.mac.hardware
3            comp.graphics
4                sci.space
Name: target, dtype: object

In [120]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3)

In [121]:
print(f"Train: {X_train.shape}")
print(f"Test: {X_test.shape}")

Train: (7919, 1)
Test: (3395, 1)


### **Simple `TfidfVectorizer`**

In [122]:
tfidf = TfidfVectorizer()

Calculate the `X_train` transformed

In [123]:
X_train_tr = tfidf.fit_transform(X_train.text)  # In this case, the "fit_transform" receives a Series with all documents

In [124]:
X_train_tr.toarray().sum()

55381.258475035334

In [125]:
X_train_tr.shape

(7919, 84890)

We see that ther're 7919 documents, and 78933 columns! (Every column corresponds to a word from the training corpus)

Now, let's train a `LogisticRegression` model

In [126]:
lr = LogisticRegression()

In [127]:
lr.fit(X_train_tr, y_train)  # we can use directly the sparse matrix "X_train_tr" in the LR model

Let's evaluate the results on **train**

In [128]:
lr.score(X_train_tr, y_train)

0.9151408006061371

Let's evaluate the results on **test**

In [129]:
# First, transform test data into numbers with Tf-Idf
X_test_tr = tfidf.transform(X_test.text) # Be careful! Here we use "transform", not "fit_transform"

In [130]:
lr.score(X_test_tr, y_test)

0.708100147275405

### Let's predict we out new model over the test set

In [131]:
lr.predict(X_test_tr)

array(['rec.sport.hockey', 'misc.forsale', 'comp.os.ms-windows.misc', ...,
       'comp.sys.mac.hardware', 'talk.religion.misc',
       'comp.sys.ibm.pc.hardware'], dtype=object)

In [132]:
probas = lr.predict_proba(X_test_tr)

In [133]:
probas[0]

array([0.00314758, 0.00404248, 0.00351152, 0.00325124, 0.00337721,
       0.00479955, 0.00442478, 0.00704534, 0.00494468, 0.02013837,
       0.89555256, 0.00448593, 0.00522128, 0.00557995, 0.00587159,
       0.00568425, 0.00504254, 0.00599117, 0.00387653, 0.00401145])

In [134]:
lr.classes_

array(['alt.atheism', 'comp.graphics', 'comp.os.ms-windows.misc',
       'comp.sys.ibm.pc.hardware', 'comp.sys.mac.hardware',
       'comp.windows.x', 'misc.forsale', 'rec.autos', 'rec.motorcycles',
       'rec.sport.baseball', 'rec.sport.hockey', 'sci.crypt',
       'sci.electronics', 'sci.med', 'sci.space',
       'soc.religion.christian', 'talk.politics.guns',
       'talk.politics.mideast', 'talk.politics.misc',
       'talk.religion.misc'], dtype=object)

In [135]:
print(X_test.iloc[0].text)

Here is yet another prediction for them great playoffs!
(you may laugh at your convenience!) :)

	Adams Division (I hate the NE (name) divisoin!!!)

BOS vs BUF   BOS in 5  (the B's are hot lately!)

MON vs QUE   MON in 7  (This will be the series to watch in the first round!)


BOS vs MON   MON in 7  (this may be a bit biased but I feel the Canadiens will
		       (smarten up and start playing they played two months ago
			( i.e. bench Savard !!!)
	Patrick Division 

PIT vs NJD   PIT in 6  (It wont be a complete cake walk... there be a few lumps
			(in the cake batter!)

WAS vs NYI   WAS in 6  	(This will not be an exciting series..IMO)


PITT vs WAS  PIT in 4   (Washington will be tired after the NYI)

	Norris Division

CHI vs StL    CHI in 5   (StL will get a lucky game in)

TOR vs DET    TOR in 7   (THis , like MON vs QUE, will be another intense 
			 (series to watch!)

CHI vs TOR    TOR in 7   (Potvin will be settling in nicely by this point.)

	Smythe Division

VAN vs WIN     VAN

### **Complete `TfidfVectorizer`**


```python
TfidfVectorizer(
    *,
    input='content',
    encoding='utf-8',
    decode_error='strict',
    strip_accents=None,
    lowercase=True,                       # automatically transform all text to lowercase
    preprocessor=None,
    tokenizer=None,                       # this controls how tokens (words) are extracted. By default text is splitted with "token_pattern"
    analyzer='word',
    stop_words=None,                      # this allow us to include stopwords
    token_pattern='(?u)\\b\\w\\w+\\b',
    ngram_range=(1, 1),                   # this allow us to automatically calculate n-grams
    max_df=1.0,                           # this controls the maximum "document freq." of a word to be included in the vocabulary
    min_df=1,                             # this controls the minimun "document freq." of a word to be included in the vocabulary
    max_features=None,                    # to limit the number of columns we have in the resulting matrix after transformation
    vocabulary=None,                      # to specify directly a vocabulary instead of being extracted from all words in text
    binary=False,
    dtype=<class 'numpy.float64'>,
    norm='l2',
    use_idf=True,
    smooth_idf=True,
    sublinear_tf=False,
)
```

In [136]:
tfidf = TfidfVectorizer(
    stop_words="english",
)

In [137]:
X_train_tr = tfidf.fit_transform(X_train.text)
X_train_tr.shape

(7919, 84581)

In [138]:
tfidf = TfidfVectorizer(
    stop_words="english",
    ngram_range=(1,2)     # this includes monograms and bigrams
)

X_train_tr = tfidf.fit_transform(X_train.text)
X_train_tr.shape

(7919, 707253)

In [139]:
tfidf = TfidfVectorizer(
    stop_words="english",
    ngram_range=(1,2),     # this includes monograms and bigrams
    min_df=5
)

X_train_tr = tfidf.fit_transform(X_train.text)
X_train_tr.shape

(7919, 20226)

In [140]:
lr = LogisticRegression()
lr.fit(X_train_tr, y_train)  # we can use directly the sparse matrix "X_train_tr" in the LR model

In [141]:
lr.score(X_train_tr, y_train)

0.9195605505745675

In [142]:
# First, transform test data into numbers with Tf-Idf
X_test_tr = tfidf.transform(X_test.text) # Be careful! Here we use "transform", not "fit_transform"

In [143]:
lr.score(X_test_tr, y_test)

0.714580265095729

In [144]:
import joblib
joblib.dump(lr, "lr2.pkl")
my_model = joblib.load("lr2.pkl")