# Pre Processing and Training


Data preprocessing is an integral step in Machine Learning as the quality of data and the useful information that can be derived from it directly affects the ability of our model to learn; therefore, it is extremely important that we preprocess our data before feeding it into our model.

In this portion of the project I will be training and modeling the data to be able to make it usable for modeling. Process:

-Creat dummy features

-Scale standardization

-Split data into training and testing subsets

In [None]:
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline
from sklearn.model_selection import train_test_split
import scipy
import pandas.util.testing as tm

In [None]:
sns.set_style(style = 'whitegrid')

In [None]:
nba = pd.read_excel(r"C:\Users\ptlon\OneDrive\Desktop\nba_1.xlsx")

In [None]:
nba

In [None]:

Unnamed: 0	Season	Team	W	L	W/L%	Finish	SRS	Pace	Rel_Pace	...	ORtg_y	DRtg_y	OWS	DWS	WS	WS/48	OBPM	DBPM	BPM	VORP
0	0	2015	Atlanta Hawks	48	34	58.5	2	3.49	97.1	1.3	...	140.0	110.0	0.0	0.0	0.0	0.291	6.3	-10.5	-4.1	0.0
1	1	2015	Atlanta Hawks	48	34	58.5	2	3.49	97.1	1.3	...	204.0	112.0	0.0	0.0	0.0	0.343	9.7	-6.6	3.2	0.0
2	2	2015	Atlanta Hawks	48	34	58.5	2	3.49	97.1	1.3	...	125.0	103.0	13.8	4.1	17.9	0.318	12.4	0.1	12.5	9.8
3	3	2015	Atlanta Hawks	48	34	58.5	2	3.49	97.1	1.3	...	122.0	104.0	11.0	3.5	14.5	0.270	7.0	0.9	7.9	6.4
4	4	2015	Atlanta Hawks	48	34	58.5	2	3.49	97.1	1.3	...	130.0	96.0	2.3	1.2	3.4	0.325	2.7	0.9	3.6	0.7
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
85465	85465	2010	Washington Wizards	23	59	28.0	5	-7.30	93.8	1.7	...	110.0	111.0	3.9	1.8	5.8	0.088	1.5	-0.4	1.1	2.5
85466	85466	2010	Washington Wizards	23	59	28.0	5	-7.30	93.8	1.7	...	101.0	111.0	0.2	0.4	0.6	0.038	-1.6	1.2	-0.4	0.3
85467	85467	2010	Washington Wizards	23	59	28.0	5	-7.30	93.8	1.7	...	107.0	114.0	2.3	0.4	2.8	0.065	0.5	-3.5	-3.0	-0.5
85468	85468	2010	Washington Wizards	23	59	28.0	5	-7.30	93.8	1.7	...	106.0	106.0	1.1	1.8	2.9	0.088	-1.6	0.3	-1.3	0.3
85469	85469	2010	Washington Wizards	23	59	28.0	5	-7.30	93.8	1.7	...	110.0	104.0	3.2	3.0	6.2	0.139	0.3	0.0	0.3	1.2
85470 rows × 45 columns



In [None]:
nrow, ncol = nba.shape
nrow, ncol

In [None]:
nba.info()

In [None]:
df = nba[['W/L%', 'Pace']]

In [None]:
df.head()

In [None]:
x = df.iloc[:, 0:1].values
y = df.iloc[:, -1].values

In [None]:
x

In [None]:
y

Divide the complete dataset into training and testing data

In [None]:
X_train, X_test, y_train, y_test = train_test_split(x, y, test_size = 0.2, random_state = 0)

Implement Classifier based on Simple Linear Regression

In [None]:
from sklearn.linear_model import LinearRegression
model = LinearRegression()
model.fit(X_train, y_train)

In [None]:
predictions = model.predict(X_test)
predictions

In [None]:
sns.distplot(predictions-y_test)

In [None]:
plt.scatter(X_train, y_train, color = 'red')
plt.plot(X_train, model.predict(X_train), color = 'blue')
plt.title('W/L% vs Pace (Training set)')
plt.xlabel('W/L%')
plt.ylabel('Pace')
plt.show()

In [None]:
plt.scatter(X_test, y_test, color = 'red')
plt.plot(X_test, model.predict(X_test), color = 'blue')
plt.title('W/L% vs Pace (Test set)')
plt.xlabel('W/L%')
plt.ylabel('Pace')
plt.show()

Standardize the magnitude of numeric features using a scaler

In [None]:
from sklearn.preprocessing import StandardScaler
sc_X = StandardScaler()
sc_Y = StandardScaler()
Xs = sc_X.fit_transform(X_test)
Ys = np.squeeze(sc_Y.fit_transform(y_test.reshape(-1, 1)))

In [None]:
models = LinearRegression()
models.fit(Xs, Ys)

In [None]:
predictionss = models.predict(Xs)
predictionss

In [None]:
sns.distplot(predictionss-Ys)

In [None]:
plt.scatter(Xs, Ys, color = 'red')
plt.plot(Xs, model.predict(Xs), color = 'blue')
plt.title('Playoffs vs 3PAr (Training set)')
plt.xlabel('Playoffs')
plt.ylabel('3PAr')
plt.show()

Convert Categorical data into dummy or indicator variables

In [None]:
VORP_dummies = pd.get_dummies(nba.VORP, prefix='VORP') 
nba = pd.concat([nba, VORP_dummies], axis = 1)

In [None]:
Player_dummies = pd.get_dummies(nba.Player, prefix='Player') 
nba = pd.concat([nba, Player_dummies], axis = 1)
nba.head()

In [None]:
dataset_new = nba.select_dtypes(include=['int', 'float'])
dataset_new.head()

In [None]:
dataset_new.to_csv("salary_dummies.csv")

In [None]:
dataset_new

# Summary

Now that the data is pre-processed and trained we are ready to start the modeling process! Data preprocessing is the process of transforming raw data into an understandable format. It is also an important step in data mining as we cannot work with raw data.

For modeling we will be looking at some categorical data that will help usp find what stats are important for winning games, and, which player stats are important for winning games.