# Seven Steps in Machine Learning

# Step 1: Frame the Problem

We need to know what the customer is requiring to get out of the data. So we will identify our target or define classes or groups. Whether we are to use supervised or unsupervised machine learning.
Whether it is a regression or a classification problem.

# Step 2: Obtain the Data

We need to input the data from the sources the customer provides. Some of the sources forms:
- Excel Sheets like comma separated values (CSV files)
- SQL Databases

!wget http://...data.csv   

Load the data to a Dataframe using pandas:   
data = pd.read_csv("data.csv")

Review the data using functions as:
- data.info() to visualize the number of columns and rows, types, and non-null records.
- data.describe() to review the mean, std in each column 
- data.head() and data.tail() to review how the data looks like
- data.shape to find quickly the number of rows and columns we are dealing to

# Step 3: Analyze the Data

For this step we can use functions as:

- ms.matrix(data) to visualize the missing values
- seaborn.jointplot to visualize the p value and the Pearson r coeficient (correlation) between the two columns (features)
- seaborn.distplot to visualize the density function graph of a specific feature
- seaborn.heatmap to visualize the correlation from cold to warm between all the features.
- seaborn.swarm to visualize the swarm graph (points of concentration) between two features (Age: Categorical Ordinal Data, Pclass: Categorical Nominal) 
- seaborn.countplot to visualize in a bar chart the Categorical data
- data.hist() to visualize one specific column or feature histogram
- seaborn.boxplot to visualize the percentiles of data betwen two features
 
Seaborn to visualize the correlation between variables, histogram of each column, swarmflight diagram

https://towardsdatascience.com/data-types-in-statistics-347e152e8bee

# Step 4: Feature Engineering

Deal with Missing Values
- Review the columns with missing values using data['Cabin'].value_counts()
- Impute the missing values where necessary using .apply(impute_age,axis=1)
- Visualize the missing values using missingo.matrix()
- Drop the features you are not going to use .drop('Cabin', axis=1, inplace=True)
- Review there are not missing values .isnull().sum()
- Drop the rows that you are not going to use .dropna(inplace=True)
- Visualize the missing value using missingo.matrix()     


Convert Categorical Features
- Review the Categorical Ordinal Data using .value_counts(). That will give you the number elements of each category in the column (feature).
- Get out the dummy values using sex = pd.get_dummmies(data['Sex'], drop_first=1)
- Optional: make a copy of your data with old_data = data.copy()   

Drop the columns not used   
- Drop the columns you are not going to use with data.drop(['Sex,'Embarked','Name','Ticket'], axis=1, inplace=True)   

Concatenate the data
- Concatenate the data that you finally are going to use with data = pd.concat([data,sex,embarked],axis=1)
- Optional: describe the data with data.describe(); also view the information with data.head(), or data.info()



# Step 5: Model Selection

Train, Test Split   
- Use Scikit Learn to split the dataset into Train, Test datasets   
   from sklearn.model_selection import train_test_split  
- Define your X dataset as your data except the column target, in this case "Survived", and your y or target as data['Survived'].
- Define the size of your test dataset size.
- Apply a random_state to shuffle both the X and y in the same order.   
   X_train, X_test, y_train, y_test = train_test_split(data.drop('Survived',axis=1), data['Survived'], test_size=0.3, random_state=101)
- You can verify using len(y_test) that you are using the percentage established

Now we are ready to choose the model
- Use SciKit Learn to choose the model of your preference   
  form sklearn.linear_model import LogisticRegression
- Create a variable where your model will reside   
    log_model = LogisticRegression
- Fit the data to your model   
    log_model.fit(X_train,y_train)
- Get the coeficients   
    log_model.coef_
- Get the intercept    
    log_model.intercept_     
    
Now we are redy to predict using the Test Data set
- Apply the predict function   
    y_predict = log_model.predict(X_test)

# Step 6: Evaluation

- Use the Confusion Matrix functions that SciKit Learn provide   
    from sklearn.metrics import confusion_matrix, classification_report
- Classification report will give you a chart summary of all the scores of accuracy , recall, and f1
- You can also import each function separately   
    from sklearn.metrics import accuracy_score   
    from sklearn.metrics import recall_score   
    from sklearn.metrics import f1_score


# Step 7: Predict on New Cases

Now you are ready to predict on New Cases. Kaggle sometimes offers you a separate dataset for validation.
Get the data
!wget http:// .../validation.csv
Load the data to a Dataframe
prod_data = pd.read_csv('production.csv')
Review the data using:
data.info()
ms.matrix()

Clean the Data
- Clean the FEATURES (columns) that you are not going to use
  prod_data.drop('Cabin', axis = 1, inplace= True)
- Since you cannot drop any ROW then you can fill with the values
  prod_data.fillna(prod_data['Fare'].mean(),inplace=True)
- Impute in the ROW of Age also is another possibility.
- Get the dummies
- Concatenate the data

Ready! Apply your model on the Clean Production Data:
- predict1=logmodel.predict(prod_data)

Then nicely create a Dataframe with the key "Survived" that stores the predictions 
df1=pd.DataFrame(predict1,columns=['Survived'])

Another thing to do is to have the Id of the Passengers
- df2=pd.DataFrame(prod_data['PassengerId'],columns=['PassengerId'])

Okay, now ready to make your final concatenation:
- result = pd.concat([df2,df1],axis=1)

Now put your data on a CSV File to submit it to Kaggle:
- result.to_csv('result.csv',index=False)

