# SciKit-Learn Logistic Regression Vignette

December 28, 2022

Vignette: SciKit-Learn Logistic Regression

@author: Oscar A. Trevizo

### References
1. SciKit-Learn Logistic Regression documentation (accessed Dec. 28, 2022) 
   https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html

### Methods (see reference)
- fit(X, y[, sample_weight]) Fit linear model.
- get_params([deep]) Get parameters for this estimator.
- predict(X) Predict using the linear model.
- score(X, y[, sample_weight]) Return the coefficient of determination of the prediction.
- set_params(**params) Set the parameters of this estimator.

# Libraries

In [1]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report as report

# Load the data

In [2]:
# See my use case 'world_migration_create_time_series.ipynb' under my GitHub 'otrevizo'
df = pd.read_csv("../data/normalized_wpp_wb.csv")
df.shape

(5978, 21)

In [3]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5978 entries, 0 to 5977
Data columns (total 21 columns):
 #   Column                            Non-Null Count  Dtype  
---  ------                            --------------  -----  
 0   Unnamed: 0                        5978 non-null   int64  
 1   Index                             5978 non-null   int64  
 2   Location                          5978 non-null   object 
 3   ISO3                              5978 non-null   object 
 4   ISO2                              5978 non-null   object 
 5   LocType                           5978 non-null   object 
 6   Year                              5978 non-null   int64  
 7   N_Population_Ks                   5978 non-null   float64
 8   N_MedAge                          5978 non-null   float64
 9   N_PopulationGrowthRate            5978 non-null   float64
 10  N_FertilityRate_births_per_woman  5978 non-null   float64
 11  N_LifeExpectancy                  5978 non-null   float64
 12  Immigr

# Prepara x and y

In [4]:
# For x, may choose to create a dataframe with the predicting variables in it
df_x = df.drop(['Unnamed: 0', 'Index', 'Location', 'ISO3', 'ISO2', 'LocType', 'Year', 'ImmigrantsEmigrants', 'ReceivesMigrants'], axis=1)
df_x.head()

Unnamed: 0,N_Population_Ks,N_MedAge,N_PopulationGrowthRate,N_FertilityRate_births_per_woman,N_LifeExpectancy,N_GDP_USD,N_logGDP,N_GDP_growth_pct,N_GDP_PCAP_USD,N_Inflation_pct,N_NetMigrants_Ks,N_NetMigrationRate_per_Kpop
0,0.001792,0.008369,0.016489,0.021848,0.007798,0.000729,0.013166,0.0,0.003157,0.0,0.000124,0.000206
1,0.001837,0.008376,0.017061,0.021839,0.007927,0.000737,0.013172,0.007302,0.003161,0.000139,0.000508,0.000821
2,0.001884,0.008385,0.017381,0.021832,0.008006,0.000774,0.013198,0.011254,0.003254,0.000178,0.000656,0.001034
3,0.001933,0.008393,0.017742,0.021823,0.008113,0.000835,0.013237,0.010959,0.003349,0.000258,0.000689,0.001061
4,0.001984,0.008401,0.018083,0.021807,0.008174,0.000917,0.013287,0.01404,0.003521,0.00026,0.000786,0.001177


In [5]:
# Prepare x and y

# The use_case has various PairPlots that suggest a logistic relationship between migration (positive vs negative) and other vars

# NOTE: May choose to use one variable x, or the entire df_x
x = df[['N_MedAge']]
y = df.ImmigrantsEmigrants

# Build train test datasets

In [6]:
# Using one variable x
# x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2, random_state=40)

# Using the entire df
x_train, x_test, y_train, y_test = train_test_split(df_x, y, test_size=0.2, random_state=40)

In [7]:
x_train.shape

(4782, 12)

In [8]:
y_train.shape

(4782,)

In [9]:
x_test.shape

(1196, 12)

In [10]:
y_test.shape

(1196,)

# Fit the model

In [11]:
# from sklearn.linear_model import LogisticRegression
logmodel = LogisticRegression()
logmodel.fit(x_train,y_train)

LogisticRegression()

# Predict

In [12]:
y_predict = logmodel.predict(x_test)

In [13]:
print(y_test)

1389     Emigrants
2195     Emigrants
2099    Immigrants
3042    Immigrants
5       Immigrants
           ...    
734      Emigrants
413      Emigrants
3995    Immigrants
3006    Immigrants
1592    Immigrants
Name: ImmigrantsEmigrants, Length: 1196, dtype: object


In [14]:
logmodel.score(x_test, y_test)

0.6998327759197325

In [15]:
# from sklearn.metrics import classification_report as report

print(report(y_test, y_predict))

              precision    recall  f1-score   support

   Emigrants       0.66      1.00      0.79       685
  Immigrants       1.00      0.30      0.46       511

    accuracy                           0.70      1196
   macro avg       0.83      0.65      0.63      1196
weighted avg       0.80      0.70      0.65      1196



Precision – Accuracy of positive predictions

Recall: Fraction of positives that were correctly identified

A system with high recall but low precision returns many results, but most of its predicted labels are incorrect when compared to the training labels.

A system with high precision but low recall is just the opposite, returning very few results, but most of its predicted labels are correct when compared to the training labels.

An ideal system with high precision and high recall will return many results, with all results labeled correctly.

F1 score – What percent of positive predictions were correct?

Support is the number of actual occurrences of the class in the specified dataset