### Problem Definition

You are provided hourly rental data spanning two years. For this competition, the training set is comprised of the first 19 days of each month, while the test set is the 20th to the end of the month. You must predict the total count of bikes rented during each hour covered by the test set, using only information available prior to the rental period.

### Features

datetime - hourly date + timestamp  

season -  1 = spring, 2 = summer, 3 = fall, 4 = winter 

holiday - whether the day is considered a holiday

workingday - whether the day is neither a weekend nor holiday

weather - 1: Clear, Few clouds, Partly cloudy, Partly cloudy

2: Mist + Cloudy, Mist + Broken clouds, Mist + Few clouds, Mist

3: Light Snow, Light Rain + Thunderstorm + Scattered clouds, Light Rain + Scattered clouds

4: Heavy Rain + Ice Pallets + Thunderstorm + Mist, Snow + Fog 

temp - temperature in Celsius

atemp - "feels like" temperature in Celsius

humidity - relative humidity

windspeed - wind speed

casual - number of non-registered user rentals initiated

registered - number of registered user rentals initiated

count - number of total rentals

In [1]:
import platform; print(platform.platform())
import sys; print('Python',sys.version)
import numpy; print("Numpy", numpy.__version__)
import scipy; print("SciPy", scipy.__version__)
import sklearn; print("Scikit-Learn", sklearn.__version__)
import xgboost; print("XGBoost", xgboost.__version__)

Windows-10-10.0.22000-SP0
Python 3.9.12 (main, Apr  4 2022, 05:22:27) [MSC v.1916 64 bit (AMD64)]
Numpy 1.21.5
SciPy 1.7.3
Scikit-Learn 1.0.2
XGBoost 1.6.1


In [2]:
import pandas as pd
import numpy as np

In [3]:
df=pd.read_csv('bike_rentals.csv')

In [4]:
df.head()

Unnamed: 0,instant,dteday,season,yr,mnth,holiday,weekday,workingday,weathersit,temp,atemp,hum,windspeed,casual,registered,cnt
0,1,2011-01-01,1.0,0.0,1.0,0.0,6.0,0.0,2,0.344167,0.363625,0.805833,0.160446,331,654,985
1,2,2011-01-02,1.0,0.0,1.0,0.0,0.0,0.0,2,0.363478,0.353739,0.696087,0.248539,131,670,801
2,3,2011-01-03,1.0,0.0,1.0,0.0,1.0,1.0,1,0.196364,0.189405,0.437273,0.248309,120,1229,1349
3,4,2011-01-04,1.0,0.0,1.0,0.0,2.0,1.0,1,0.2,0.212122,0.590435,0.160296,108,1454,1562
4,5,2011-01-05,1.0,0.0,1.0,0.0,3.0,1.0,1,0.226957,0.22927,0.436957,0.1869,82,1518,1600


In [5]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 731 entries, 0 to 730
Data columns (total 16 columns):
 #   Column      Non-Null Count  Dtype  
---  ------      --------------  -----  
 0   instant     731 non-null    int64  
 1   dteday      731 non-null    object 
 2   season      731 non-null    float64
 3   yr          730 non-null    float64
 4   mnth        730 non-null    float64
 5   holiday     731 non-null    float64
 6   weekday     731 non-null    float64
 7   workingday  731 non-null    float64
 8   weathersit  731 non-null    int64  
 9   temp        730 non-null    float64
 10  atemp       730 non-null    float64
 11  hum         728 non-null    float64
 12  windspeed   726 non-null    float64
 13  casual      731 non-null    int64  
 14  registered  731 non-null    int64  
 15  cnt         731 non-null    int64  
dtypes: float64(10), int64(5), object(1)
memory usage: 91.5+ KB


In [6]:
df.describe()

Unnamed: 0,instant,season,yr,mnth,holiday,weekday,workingday,weathersit,temp,atemp,hum,windspeed,casual,registered,cnt
count,731.0,731.0,730.0,730.0,731.0,731.0,731.0,731.0,730.0,730.0,728.0,726.0,731.0,731.0,731.0
mean,366.0,2.49658,0.5,6.512329,0.028728,2.997264,0.682627,1.395349,0.495587,0.474512,0.627987,0.190476,848.176471,3656.172367,4504.348837
std,211.165812,1.110807,0.500343,3.448303,0.167155,2.004787,0.465773,0.544894,0.183094,0.163017,0.142331,0.077725,686.622488,1560.256377,1937.211452
min,1.0,1.0,0.0,1.0,0.0,0.0,0.0,1.0,0.05913,0.07907,0.0,0.022392,2.0,20.0,22.0
25%,183.5,2.0,0.0,4.0,0.0,1.0,0.0,1.0,0.336875,0.337794,0.521562,0.134494,315.5,2497.0,3152.0
50%,366.0,3.0,0.5,7.0,0.0,3.0,1.0,1.0,0.499166,0.487364,0.627083,0.180971,713.0,3662.0,4548.0
75%,548.5,3.0,1.0,9.75,0.0,5.0,1.0,2.0,0.655625,0.608916,0.730104,0.233218,1096.0,4776.5,5956.0
max,731.0,4.0,1.0,12.0,1.0,6.0,1.0,3.0,0.861667,0.840896,0.9725,0.507463,3410.0,6946.0,8714.0


In [7]:
#Difference in mean and median shows distortion in the df

In [8]:
df.isnull().sum()

instant       0
dteday        0
season        0
yr            1
mnth          1
holiday       0
weekday       0
workingday    0
weathersit    0
temp          1
atemp         1
hum           3
windspeed     5
casual        0
registered    0
cnt           0
dtype: int64

In [9]:
df[df.isna().any(axis=1)]

Unnamed: 0,instant,dteday,season,yr,mnth,holiday,weekday,workingday,weathersit,temp,atemp,hum,windspeed,casual,registered,cnt
56,57,2011-02-26,1.0,0.0,2.0,0.0,6.0,0.0,1,0.2825,0.282192,0.537917,,424,1545,1969
81,82,2011-03-23,2.0,0.0,3.0,0.0,3.0,1.0,2,0.346957,0.337939,0.839565,,203,1918,2121
128,129,2011-05-09,2.0,0.0,5.0,0.0,1.0,1.0,1,0.5325,0.525246,0.58875,,664,3698,4362
129,130,2011-05-10,2.0,0.0,5.0,0.0,2.0,1.0,1,0.5325,0.522721,,0.115671,694,4109,4803
213,214,2011-08-02,3.0,0.0,8.0,0.0,2.0,1.0,1,0.783333,0.707071,,0.20585,801,4044,4845
298,299,2011-10-26,4.0,0.0,10.0,0.0,3.0,1.0,2,0.484167,0.472846,0.720417,,404,3490,3894
388,389,2012-01-24,1.0,1.0,1.0,0.0,2.0,1.0,1,0.3425,0.349108,,0.123767,439,3900,4339
528,529,2012-06-12,2.0,1.0,6.0,0.0,2.0,1.0,2,0.653333,0.597875,0.833333,,477,4495,4972
701,702,2012-12-02,4.0,1.0,12.0,0.0,0.0,0.0,2,,,0.823333,0.124379,892,3757,4649
730,731,2012-12-31,1.0,,,0.0,1.0,0.0,2,0.215833,0.223487,0.5775,0.154846,439,2290,2729


In [10]:
df['windspeed'].fillna((df['windspeed'].median()), inplace=True)

In [11]:
df.iloc[[56,81,128]]

Unnamed: 0,instant,dteday,season,yr,mnth,holiday,weekday,workingday,weathersit,temp,atemp,hum,windspeed,casual,registered,cnt
56,57,2011-02-26,1.0,0.0,2.0,0.0,6.0,0.0,1,0.2825,0.282192,0.537917,0.180971,424,1545,1969
81,82,2011-03-23,2.0,0.0,3.0,0.0,3.0,1.0,2,0.346957,0.337939,0.839565,0.180971,203,1918,2121
128,129,2011-05-09,2.0,0.0,5.0,0.0,1.0,1.0,1,0.5325,0.525246,0.58875,0.180971,664,3698,4362


In [12]:
df['windspeed'].median()

0.180971

## Select certain rows to fill missing data

In [13]:
df.iloc[701]

instant              702
dteday        2012-12-02
season               4.0
yr                   1.0
mnth                12.0
holiday              0.0
weekday              0.0
workingday           0.0
weathersit             2
temp                 NaN
atemp                NaN
hum             0.823333
windspeed       0.124379
casual               892
registered          3757
cnt                 4649
Name: 701, dtype: object

In [14]:
#for this specific data, i will use mean of next day and one day before
mean_temp = (df.iloc[700]['temp'] + df.iloc[702]['temp'])/2
mean_atemp = (df.iloc[700]['atemp'] + df.iloc[702]['atemp'])/2

In [15]:
df['temp'].fillna((mean_temp), inplace=True)
df['atemp'].fillna((mean_atemp), inplace=True)

In [16]:
df[df.isna().any(axis=1)]

Unnamed: 0,instant,dteday,season,yr,mnth,holiday,weekday,workingday,weathersit,temp,atemp,hum,windspeed,casual,registered,cnt
129,130,2011-05-10,2.0,0.0,5.0,0.0,2.0,1.0,1,0.5325,0.522721,,0.115671,694,4109,4803
213,214,2011-08-02,3.0,0.0,8.0,0.0,2.0,1.0,1,0.783333,0.707071,,0.20585,801,4044,4845
388,389,2012-01-24,1.0,1.0,1.0,0.0,2.0,1.0,1,0.3425,0.349108,,0.123767,439,3900,4339
730,731,2012-12-31,1.0,,,0.0,1.0,0.0,2,0.215833,0.223487,0.5775,0.154846,439,2290,2729


In [17]:
df['hum'].fillna(df['hum'].median(),inplace=True)

In [18]:
df[df.isna().any(axis=1)]

Unnamed: 0,instant,dteday,season,yr,mnth,holiday,weekday,workingday,weathersit,temp,atemp,hum,windspeed,casual,registered,cnt
730,731,2012-12-31,1.0,,,0.0,1.0,0.0,2,0.215833,0.223487,0.5775,0.154846,439,2290,2729


In [19]:
df['dteday']

0      2011-01-01
1      2011-01-02
2      2011-01-03
3      2011-01-04
4      2011-01-05
          ...    
726    2012-12-27
727    2012-12-28
728    2012-12-29
729    2012-12-30
730    2012-12-31
Name: dteday, Length: 731, dtype: object

[convert object to datetime format](https://pandas.pydata.org/pandas-docs/version/1.0.1/reference/api/pandas.to_datetime.html)

[convert object to datetime format]:(https://pandas.pydata.org/pandas-docs/version/1.0.1/reference/api/pandas.to_datetime.html)

In [20]:
#convert object to datetime format

df['dteday']=df['dteday'].apply(pd.to_datetime, infer_datetime_format=True, errors='coerce')

In [21]:
df['dteday']

0     2011-01-01
1     2011-01-02
2     2011-01-03
3     2011-01-04
4     2011-01-05
         ...    
726   2012-12-27
727   2012-12-28
728   2012-12-29
729   2012-12-30
730   2012-12-31
Name: dteday, Length: 731, dtype: datetime64[ns]

In [22]:
import datetime as dt

In [23]:
df['mnth'] = df['dteday'].dt.month

In [24]:
df['mnth'].iloc[730]

12

In [25]:
df.loc[730,'yr'] = 1.0

In [26]:
df.iloc[730]

instant                       731
dteday        2012-12-31 00:00:00
season                        1.0
yr                            1.0
mnth                           12
holiday                       0.0
weekday                       1.0
workingday                    0.0
weathersit                      2
temp                     0.215833
atemp                    0.223487
hum                        0.5775
windspeed                0.154846
casual                        439
registered                   2290
cnt                          2729
Name: 730, dtype: object

In [27]:
df

Unnamed: 0,instant,dteday,season,yr,mnth,holiday,weekday,workingday,weathersit,temp,atemp,hum,windspeed,casual,registered,cnt
0,1,2011-01-01,1.0,0.0,1,0.0,6.0,0.0,2,0.344167,0.363625,0.805833,0.160446,331,654,985
1,2,2011-01-02,1.0,0.0,1,0.0,0.0,0.0,2,0.363478,0.353739,0.696087,0.248539,131,670,801
2,3,2011-01-03,1.0,0.0,1,0.0,1.0,1.0,1,0.196364,0.189405,0.437273,0.248309,120,1229,1349
3,4,2011-01-04,1.0,0.0,1,0.0,2.0,1.0,1,0.200000,0.212122,0.590435,0.160296,108,1454,1562
4,5,2011-01-05,1.0,0.0,1,0.0,3.0,1.0,1,0.226957,0.229270,0.436957,0.186900,82,1518,1600
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
726,727,2012-12-27,1.0,1.0,12,0.0,4.0,1.0,2,0.254167,0.226642,0.652917,0.350133,247,1867,2114
727,728,2012-12-28,1.0,1.0,12,0.0,5.0,1.0,2,0.253333,0.255046,0.590000,0.155471,644,2451,3095
728,729,2012-12-29,1.0,1.0,12,0.0,6.0,0.0,2,0.253333,0.242400,0.752917,0.124383,159,1182,1341
729,730,2012-12-30,1.0,1.0,12,0.0,0.0,0.0,1,0.255833,0.231700,0.483333,0.350754,364,1432,1796


In [28]:
df=df.drop('dteday',axis=1)

In [29]:
df.head(10)

Unnamed: 0,instant,season,yr,mnth,holiday,weekday,workingday,weathersit,temp,atemp,hum,windspeed,casual,registered,cnt
0,1,1.0,0.0,1,0.0,6.0,0.0,2,0.344167,0.363625,0.805833,0.160446,331,654,985
1,2,1.0,0.0,1,0.0,0.0,0.0,2,0.363478,0.353739,0.696087,0.248539,131,670,801
2,3,1.0,0.0,1,0.0,1.0,1.0,1,0.196364,0.189405,0.437273,0.248309,120,1229,1349
3,4,1.0,0.0,1,0.0,2.0,1.0,1,0.2,0.212122,0.590435,0.160296,108,1454,1562
4,5,1.0,0.0,1,0.0,3.0,1.0,1,0.226957,0.22927,0.436957,0.1869,82,1518,1600
5,6,1.0,0.0,1,0.0,4.0,1.0,1,0.204348,0.233209,0.518261,0.089565,88,1518,1606
6,7,1.0,0.0,1,0.0,5.0,1.0,2,0.196522,0.208839,0.498696,0.168726,148,1362,1510
7,8,1.0,0.0,1,0.0,6.0,0.0,2,0.165,0.162254,0.535833,0.266804,68,891,959
8,9,1.0,0.0,1,0.0,0.0,0.0,1,0.138333,0.116175,0.434167,0.36195,54,768,822
9,10,1.0,0.0,1,0.0,1.0,1.0,1,0.150833,0.150888,0.482917,0.223267,41,1280,1321


#### since the feature cnt is equal to the sum of casual and registered, i'll remove casual and registered.

In [30]:
df = df.drop(['casual','cnt'], axis=1)

In [31]:
df.head(5)

Unnamed: 0,instant,season,yr,mnth,holiday,weekday,workingday,weathersit,temp,atemp,hum,windspeed,registered
0,1,1.0,0.0,1,0.0,6.0,0.0,2,0.344167,0.363625,0.805833,0.160446,654
1,2,1.0,0.0,1,0.0,0.0,0.0,2,0.363478,0.353739,0.696087,0.248539,670
2,3,1.0,0.0,1,0.0,1.0,1.0,1,0.196364,0.189405,0.437273,0.248309,1229
3,4,1.0,0.0,1,0.0,2.0,1.0,1,0.2,0.212122,0.590435,0.160296,1454
4,5,1.0,0.0,1,0.0,3.0,1.0,1,0.226957,0.22927,0.436957,0.1869,1518


In [32]:
df.to_csv('cleaned data', index=False)

In [33]:
X = df.iloc[:,:-1]
y = df.iloc[:,-1]

### now i will split dataset into training set and test set

In [34]:
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression

In [35]:
X_train, X_test, y_train, y_test = train_test_split(X,y, random_state=42)

[Why 42?](https://grsahagian.medium.com/what-is-random-state-42-d803402ee76b#:~:text=The%20number%2042%20is%20sort,over%20the%20period%20of%207.5)

In [36]:
import warnings
warnings.filterwarnings("ignore")

In [37]:
#Linear Regression object

lin_reg=LinearRegression()

In [38]:
#fit data into the model

lin_reg.fit(X_train,y_train)

LinearRegression()

In [39]:
#Creating y prediction using x_test model

y_pred=lin_reg.predict(X_test)

In [40]:
from sklearn.metrics import mean_squared_error
import numpy as np
mse = mean_squared_error(y_test, y_pred)
rmse = np.sqrt(mse)

In [41]:
print("RMSE: %0.02f" %(rmse))

RMSE: 674.50


### Let's try XGBoost

In [42]:
from xgboost import XGBRegressor

In [43]:
xg_reg = XGBRegressor()
xg_reg.fit(X_train, y_train)
y_pred = xg_reg.predict(X_test)

In [44]:
mse = mean_squared_error(y_test,y_pred)
rmse = np.sqrt(mse)
print("RMSE: %0.02f" %(rmse))

RMSE: 524.76


### Cross Validation

if we use 5 fold that will take 20 % of entire data as test set.
increase in fold will cause average score is not sensitive as we use more data as training set

In [45]:
from sklearn.model_selection import cross_val_score

In [46]:
model = LinearRegression()

In [47]:
scores = cross_val_score(model, X, y, scoring='neg_mean_squared_error', cv=10)

In [48]:
rmse = np.sqrt(-scores)

In [49]:
print("RMSE: ", np.round(rmse,2))
print("RMSE Avg: %0.02f" % (rmse.mean()))

RMSE:  [ 458.92  663.04  749.73  537.13  579.44  565.63  670.05  924.05  870.94
 1252.47]
RMSE Avg: 727.14


### Cross Validation with XGBoost

In [50]:
model = XGBRegressor()

In [51]:
scores = cross_val_score(model, X, y, scoring='neg_mean_squared_error', cv=10)
rmse = np.sqrt(-scores)
print("RMSE: ", np.round(rmse,2))
print("RMSE Avg: %0.02f" % (rmse.mean()))

RMSE:  [ 391.34  460.14  341.21  503.21  788.31 1028.42  660.13  560.91  577.51
 1485.85]
RMSE Avg: 679.70


### How About Logistic Regression?

Extraction was done by Barry Becker from the 1994 Census database. A set of reasonably clean records was extracted using the following conditions: ((AAGE>16) && (AGI>100) && (AFNLWGT>1)&& (HRSWK>0))

Prediction task is to determine whether a person makes over 50K a year.

In [52]:
df_census = pd.read_csv("https://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.data")

Attribute Information:

Listing of attributes:

>50K, <=50K.

age: continuous.

workclass: Private, Self-emp-not-inc, Self-emp-inc, Federal-gov, Local-gov, State-gov, Without-pay, Never-worked.

fnlwgt: continuous.

education: Bachelors, Some-college, 11th, HS-grad, Prof-school, Assoc-acdm, Assoc-voc, 9th, 7th-8th, 12th, Masters, 1st-4th, 10th, Doctorate, 5th-6th, Preschool.

education-num: continuous.

marital-status: Married-civ-spouse, Divorced, Never-married, Separated, Widowed, Married-spouse-absent, Married-AF-spouse.

occupation: Tech-support, Craft-repair, Other-service, Sales, Exec-managerial, Prof-specialty, Handlers-cleaners, Machine-op-inspct, Adm-clerical, Farming-fishing, Transport-moving, Priv-house-serv, Protective-serv, Armed-Forces.

relationship: Wife, Own-child, Husband, Not-in-family, Other-relative, Unmarried.

race: White, Asian-Pac-Islander, Amer-Indian-Eskimo, Other, Black.

sex: Female, Male.

capital-gain: continuous.

capital-loss: continuous.

hours-per-week: continuous.

native-country: United-States, Cambodia, England, Puerto-Rico, Canada, Germany, Outlying-US(Guam-USVI-etc), India, Japan, Greece, South, China, Cuba, Iran, Honduras, Philippines, Italy, Poland, Jamaica, Vietnam, Mexico, Portugal, Ireland, France, Dominican-Republic, Laos, Ecuador, Taiwan, Haiti, Columbia, Hungary, Guatemala, Nicaragua, Scotland, Thailand, Yugoslavia, El-Salvador, Trinadad&Tobago, Peru, Hong, Holand-Netherlands.



In [53]:
df_census.head(10)

Unnamed: 0,39,State-gov,77516,Bachelors,13,Never-married,Adm-clerical,Not-in-family,White,Male,2174,0,40,United-States,<=50K
0,50,Self-emp-not-inc,83311,Bachelors,13,Married-civ-spouse,Exec-managerial,Husband,White,Male,0,0,13,United-States,<=50K
1,38,Private,215646,HS-grad,9,Divorced,Handlers-cleaners,Not-in-family,White,Male,0,0,40,United-States,<=50K
2,53,Private,234721,11th,7,Married-civ-spouse,Handlers-cleaners,Husband,Black,Male,0,0,40,United-States,<=50K
3,28,Private,338409,Bachelors,13,Married-civ-spouse,Prof-specialty,Wife,Black,Female,0,0,40,Cuba,<=50K
4,37,Private,284582,Masters,14,Married-civ-spouse,Exec-managerial,Wife,White,Female,0,0,40,United-States,<=50K
5,49,Private,160187,9th,5,Married-spouse-absent,Other-service,Not-in-family,Black,Female,0,0,16,Jamaica,<=50K
6,52,Self-emp-not-inc,209642,HS-grad,9,Married-civ-spouse,Exec-managerial,Husband,White,Male,0,0,45,United-States,>50K
7,31,Private,45781,Masters,14,Never-married,Prof-specialty,Not-in-family,White,Female,14084,0,50,United-States,>50K
8,42,Private,159449,Bachelors,13,Married-civ-spouse,Exec-managerial,Husband,White,Male,5178,0,40,United-States,>50K
9,37,Private,280464,Some-college,10,Married-civ-spouse,Exec-managerial,Husband,Black,Male,0,0,80,United-States,>50K


In [54]:
data=[]
for i in df_census.columns:
    print("This has been added to list",i)
    data.append(i)
data

This has been added to list 39
This has been added to list  State-gov
This has been added to list  77516
This has been added to list  Bachelors
This has been added to list  13
This has been added to list  Never-married
This has been added to list  Adm-clerical
This has been added to list  Not-in-family
This has been added to list  White
This has been added to list  Male
This has been added to list  2174
This has been added to list  0
This has been added to list  40
This has been added to list  United-States
This has been added to list  <=50K


['39',
 ' State-gov',
 ' 77516',
 ' Bachelors',
 ' 13',
 ' Never-married',
 ' Adm-clerical',
 ' Not-in-family',
 ' White',
 ' Male',
 ' 2174',
 ' 0',
 ' 40',
 ' United-States',
 ' <=50K']

In [55]:
df_census.iloc[0]=data

In [56]:
df_census

Unnamed: 0,39,State-gov,77516,Bachelors,13,Never-married,Adm-clerical,Not-in-family,White,Male,2174,0,40,United-States,<=50K
0,39,State-gov,77516,Bachelors,13,Never-married,Adm-clerical,Not-in-family,White,Male,2174,0,40,United-States,<=50K
1,38,Private,215646,HS-grad,9,Divorced,Handlers-cleaners,Not-in-family,White,Male,0,0,40,United-States,<=50K
2,53,Private,234721,11th,7,Married-civ-spouse,Handlers-cleaners,Husband,Black,Male,0,0,40,United-States,<=50K
3,28,Private,338409,Bachelors,13,Married-civ-spouse,Prof-specialty,Wife,Black,Female,0,0,40,Cuba,<=50K
4,37,Private,284582,Masters,14,Married-civ-spouse,Exec-managerial,Wife,White,Female,0,0,40,United-States,<=50K
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
32555,27,Private,257302,Assoc-acdm,12,Married-civ-spouse,Tech-support,Wife,White,Female,0,0,38,United-States,<=50K
32556,40,Private,154374,HS-grad,9,Married-civ-spouse,Machine-op-inspct,Husband,White,Male,0,0,40,United-States,>50K
32557,58,Private,151910,HS-grad,9,Widowed,Adm-clerical,Unmarried,White,Female,0,0,40,United-States,<=50K
32558,22,Private,201490,HS-grad,9,Never-married,Adm-clerical,Own-child,White,Male,0,0,20,United-States,<=50K


### We need to add index for the df

In [57]:
df_census.columns=['Age', 'workclass', 'fnlwgt', 'education', 'education-nun','occupation', 'marital-status', 'relationship', 'race','sex', 'capital-gain', 'capital-loss', 'hours-per-week','native-country', 'income']
df_census.head(10)

Unnamed: 0,Age,workclass,fnlwgt,education,education-nun,occupation,marital-status,relationship,race,sex,capital-gain,capital-loss,hours-per-week,native-country,income
0,39,State-gov,77516,Bachelors,13,Never-married,Adm-clerical,Not-in-family,White,Male,2174,0,40,United-States,<=50K
1,38,Private,215646,HS-grad,9,Divorced,Handlers-cleaners,Not-in-family,White,Male,0,0,40,United-States,<=50K
2,53,Private,234721,11th,7,Married-civ-spouse,Handlers-cleaners,Husband,Black,Male,0,0,40,United-States,<=50K
3,28,Private,338409,Bachelors,13,Married-civ-spouse,Prof-specialty,Wife,Black,Female,0,0,40,Cuba,<=50K
4,37,Private,284582,Masters,14,Married-civ-spouse,Exec-managerial,Wife,White,Female,0,0,40,United-States,<=50K
5,49,Private,160187,9th,5,Married-spouse-absent,Other-service,Not-in-family,Black,Female,0,0,16,Jamaica,<=50K
6,52,Self-emp-not-inc,209642,HS-grad,9,Married-civ-spouse,Exec-managerial,Husband,White,Male,0,0,45,United-States,>50K
7,31,Private,45781,Masters,14,Never-married,Prof-specialty,Not-in-family,White,Female,14084,0,50,United-States,>50K
8,42,Private,159449,Bachelors,13,Married-civ-spouse,Exec-managerial,Husband,White,Male,5178,0,40,United-States,>50K
9,37,Private,280464,Some-college,10,Married-civ-spouse,Exec-managerial,Husband,Black,Male,0,0,80,United-States,>50K


In [58]:
df_census.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 32560 entries, 0 to 32559
Data columns (total 15 columns):
 #   Column          Non-Null Count  Dtype 
---  ------          --------------  ----- 
 0   Age             32560 non-null  object
 1   workclass       32560 non-null  object
 2   fnlwgt          32560 non-null  object
 3   education       32560 non-null  object
 4   education-nun   32560 non-null  object
 5   occupation      32560 non-null  object
 6   marital-status  32560 non-null  object
 7   relationship    32560 non-null  object
 8   race            32560 non-null  object
 9   sex             32560 non-null  object
 10  capital-gain    32560 non-null  object
 11  capital-loss    32560 non-null  object
 12  hours-per-week  32560 non-null  object
 13  native-country  32560 non-null  object
 14  income          32560 non-null  object
dtypes: object(15)
memory usage: 3.7+ MB


In [59]:
#change some of values into int type

df_census['Age'] =  df_census['Age'].astype(int)
df_census['education-nun'] =  df_census['education-nun'].astype(int)
df_census['fnlwgt'] =  df_census['fnlwgt'].astype(int)
df_census['capital-gain'] =  df_census['capital-gain'].astype(int)
df_census['capital-loss'] =  df_census['capital-loss'].astype(int)
df_census['hours-per-week'] =  df_census['hours-per-week'].astype(int)
df_census.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 32560 entries, 0 to 32559
Data columns (total 15 columns):
 #   Column          Non-Null Count  Dtype 
---  ------          --------------  ----- 
 0   Age             32560 non-null  int32 
 1   workclass       32560 non-null  object
 2   fnlwgt          32560 non-null  int32 
 3   education       32560 non-null  object
 4   education-nun   32560 non-null  int32 
 5   occupation      32560 non-null  object
 6   marital-status  32560 non-null  object
 7   relationship    32560 non-null  object
 8   race            32560 non-null  object
 9   sex             32560 non-null  object
 10  capital-gain    32560 non-null  int32 
 11  capital-loss    32560 non-null  int32 
 12  hours-per-week  32560 non-null  int32 
 13  native-country  32560 non-null  object
 14  income          32560 non-null  object
dtypes: int32(6), object(9)
memory usage: 3.0+ MB


This dataset contains object values. There are lots of ways to deal with this data, but we will do convert it to numeric values.
pd.get_dummies() will change all those as 0, 1 values from objects.
Let's drop education cloumns as it's same as education_num

In [60]:
df_census = df_census.drop(['education'],axis=1)

In [61]:
df_census = pd.get_dummies(df_census)
df_census.head(10)

Unnamed: 0,Age,fnlwgt,education-nun,capital-gain,capital-loss,hours-per-week,workclass_ ?,workclass_ Federal-gov,workclass_ Local-gov,workclass_ Never-worked,...,native-country_ Scotland,native-country_ South,native-country_ Taiwan,native-country_ Thailand,native-country_ Trinadad&Tobago,native-country_ United-States,native-country_ Vietnam,native-country_ Yugoslavia,income_ <=50K,income_ >50K
0,39,77516,13,2174,0,40,0,0,0,0,...,0,0,0,0,0,1,0,0,1,0
1,38,215646,9,0,0,40,0,0,0,0,...,0,0,0,0,0,1,0,0,1,0
2,53,234721,7,0,0,40,0,0,0,0,...,0,0,0,0,0,1,0,0,1,0
3,28,338409,13,0,0,40,0,0,0,0,...,0,0,0,0,0,0,0,0,1,0
4,37,284582,14,0,0,40,0,0,0,0,...,0,0,0,0,0,1,0,0,1,0
5,49,160187,5,0,0,16,0,0,0,0,...,0,0,0,0,0,0,0,0,1,0
6,52,209642,9,0,0,45,0,0,0,0,...,0,0,0,0,0,1,0,0,0,1
7,31,45781,14,14084,0,50,0,0,0,0,...,0,0,0,0,0,1,0,0,0,1
8,42,159449,13,5178,0,40,0,0,0,0,...,0,0,0,0,0,1,0,0,0,1
9,37,280464,10,0,0,80,0,0,0,0,...,0,0,0,0,0,1,0,0,0,1


In [62]:
df_census=df_census.drop(['income_ <=50K'],axis=1)
df_census.head(10)

Unnamed: 0,Age,fnlwgt,education-nun,capital-gain,capital-loss,hours-per-week,workclass_ ?,workclass_ Federal-gov,workclass_ Local-gov,workclass_ Never-worked,...,native-country_ Puerto-Rico,native-country_ Scotland,native-country_ South,native-country_ Taiwan,native-country_ Thailand,native-country_ Trinadad&Tobago,native-country_ United-States,native-country_ Vietnam,native-country_ Yugoslavia,income_ >50K
0,39,77516,13,2174,0,40,0,0,0,0,...,0,0,0,0,0,0,1,0,0,0
1,38,215646,9,0,0,40,0,0,0,0,...,0,0,0,0,0,0,1,0,0,0
2,53,234721,7,0,0,40,0,0,0,0,...,0,0,0,0,0,0,1,0,0,0
3,28,338409,13,0,0,40,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,37,284582,14,0,0,40,0,0,0,0,...,0,0,0,0,0,0,1,0,0,0
5,49,160187,5,0,0,16,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
6,52,209642,9,0,0,45,0,0,0,0,...,0,0,0,0,0,0,1,0,0,1
7,31,45781,14,14084,0,50,0,0,0,0,...,0,0,0,0,0,0,1,0,0,1
8,42,159449,13,5178,0,40,0,0,0,0,...,0,0,0,0,0,0,1,0,0,1
9,37,280464,10,0,0,80,0,0,0,0,...,0,0,0,0,0,0,1,0,0,1


In [63]:
X=df_census.iloc[:,:-1]
y=df_census.iloc[:,-1]

### Logistic modelling

In [64]:
from sklearn.linear_model import LogisticRegression

In [65]:
model = LogisticRegression()

In [66]:
#create function for cross_val

def cross_val(classifier, num_splits=10):
    model = classifier
    scores = cross_val_score(model, X, y, cv=num_splits)
    print("Accuracy: ", np.round(scores,2))
    print("Average Accuracy: %0.02f "% (scores.mean()))

In [67]:
cross_val(LogisticRegression())

Accuracy:  [0.8  0.8  0.79 0.8  0.79 0.81 0.79 0.8  0.8  0.8 ]
Average Accuracy: 0.80 


In [68]:
from xgboost import XGBClassifier

In [69]:
cross_val(XGBClassifier(n_estimators=5))

Accuracy:  [0.85 0.86 0.87 0.85 0.86 0.86 0.86 0.87 0.86 0.86]
Average Accuracy: 0.86 


In [71]:
df_census.to_csv("Census_cleaned")