# Project 1: Removing Inconsistencies in Concrete Compressive Strength

## Read Carefully

Hi, welcome to your capstone project once again on regression. In this capstone, you are going to predict the value of concrete strength based on records from a machine as described in this [document](https://docs.google.com/document/d/1oxjx9r5ZsjLU0m6cJIHbC4X5p7nXnKZEJWjeI2BuyOg/edit?usp=sharing).


You are required to explain what you do and why you make certain decisions every step of the workflow.

>  **The minimum performance measure you should aim for using the $R-squared$ ($r^2$) evaluation metric is 30%.**

65% of your mark would be based on how well you structured your notebook and followed the instructions while 35% of your mark will be based on whether you were able to beat the performance threshold on the test data **without overfitting**.

DEADLINE TO SUBMIT THIS PROJECT IS 3RD OF SEPTEMBER 2020 BY 11:59 PM.

## Where to get the data for the project?

You can get the data here: http://archive.ics.uci.edu/ml/datasets/Concrete+Compressive+Strength > Data Folder > Concrete_Data.xls

> Please download both the `Concrete_Data.xls` and `Concrete_Readme.txt` because you'll need the `.txt` fle to understand the data source, description, characteristics, and so on. This is for proper attribution and complaince (if there is any need, because it is an open data set).

## How should you structure your project?

### Offline

If you are working on this project on your local machine (or away from Colaboratory), please ensure you create a new folder for this capstone project entirely to ensure proper folder structure and perhaps naming conventions.

Make sure the dataset is located in this new folder, so as your Jupyter Notebook and other files associated with the project. This is so that it will be very easy for you to commit to your GitHub repository when it's time.


### Online (Google Colaboratory)

If you are working directly on G Colab, please ensure that you create a new folder in your Google drive account and have your notebook in this folder.

Use this direct link to load the data: http://archive.ics.uci.edu/ml/machine-learning-databases/concrete/compressive/Concrete_Data.xls

Remember the file extension is `.xls`. Please find out how to use Pandas to load such data. If you cannot find a way, remember that you can download it to your PC, open it up in Excel (or other spreadsheet programs), and convert it to `.csv`.

Please also link to the data description or just the data folder for proper attributions using this link: http://archive.ics.uci.edu/ml/machine-learning-databases/concrete/compressive/

## Instructions 

The instructions here are no means **strict** rules for you to follow, but rather should serve as a guide for you to follow along. Please feel free to explore your own paths and methods, get creative about how to solve the problem at hand.

You can review the typical ML workflow below 

![ML workflow](https://cdn.hashnode.com/res/hashnode/image/upload/v1588203774266/AneXwuOgV.png?auto=format&q=60).

### 1. Frame the problem

Help me understand why you think this problem can be solved with Machine Learning and why it is a regression problem.

Please answer the following questions along the way:

1. What is the overall objective of the project? What are we trying to achieve based on the interaction from [this document](https://docs.google.com/document/d/1oxjx9r5ZsjLU0m6cJIHbC4X5p7nXnKZEJWjeI2BuyOg/edit?usp=sharing)?

2. How do you think this project will impact the company or department involved?

2. Should this project be solved with ML? If yes, what makes it a Machine Learning problem? (If it is actually a machine learning problem.)

2. If it is an ML problem, what makes it a regression problem? (If it is actually a regression problem.)


> Once again, the performance threshold expected is not less than 30% as we described earlier.

### 2. Get the data relevant/related to the problem you are trying to solve.

Inspect the data and the data description to make sure you understand what the data source is and other characteristics of the data. Read the description carefully, if any attribution is required, **please do well to give such attribution just before you start importing your libraries.**

It would also be helpful to answer the following questions in your notebook.

1. Is your data source a primary data source or third-party data source?

2. Would you consider the number of observations in the data small? If yes, do you think it is suitable for machine learning problem?

3. Do you think the data attributes are relevant to the problem you are trying to solve or you need more insights to confirm?

4. Do you think you have enough domain knowledge to figure out what attributes are good or bad?

### - Import your libraries

Help me understand what libraries and/or frameworks you think would be important for this project and why.

> List out the libraries you think would be useful for solving this problem and in short sentences, why you think they'd be useful.

### 3. Explore the data to gain insights on it.

Look for things like;



* Number of attributes and observations.

* Units for each attributes and if you think they correspond to actual units in real-world circumstances. (You may have to do some research for this one.)

* Missing values.

* Consider renaming the attributes/features using simpler naming conventions so it would be easier for you to work with.

* Make a pair-plot to visiually figure out what attributes are correlated to each other.

* Mere looking at the pairplot, are there outliers in the data? **Note that you can also use code snippets and methods with SciPy libs to find outliers if you are not sure they exist in the data by just looking at the plot.**

* Plot a correlation graph to truly understand how correlated each features are to each other.

* Note which ones are positively correlated, negatively correlated, and have zero correlation.

* You can decide to plot various attributes against each other for more visual understanding.

* Can you sense the presence of multicollinear features? If yes, what are they? If they are greater than 0.7 in collinearity, you may want to remove them!

### 4. Prepare the data to better expose the underlyingdata patterns to Machine Learning algorithms.

* Do you think you need to take care of any missing values?

* Do you think you should remove any feature because of multicollinearity, irrelavance, or wrong metric?

* Do you think you can create entierly new features based on these old features that may help your algorithm learn better?

* Do you think you need to perform scaling of any kind? If yes, **please remember to split your data set before scaling.**

### 5. Explore many different ML algorithms suitable for your problem and short-list the best models.

* Split the data set (if you haven't)!

* Try a basic algorithm or "weak model" to get a baseline performance.

* Try out the various regression algorithms and models you've learned about to see which one improves the result with either just the defualt hyperparameters or your random guessing.

* Make sure to always evaluate for overfitting and underfitting.

* You can decide to write a function so your code is modular and easily reusable.

### 6. Fine-tune your models and combine them into a great solution.

* You can short-list your most promising models from the previous step.

* Try using hyperparameter optimization technqiues to see if you can improve the performance of the model.

* If you use `GridSearchCV`, make sure to specify the `cv` hyperparmeter. Please review [this video](https://youtu.be/fSytzGwwBVw) to learn about the fundamentals of cross-validation. 

* Make sure you are consistently evlauting for overfitting and underfitting, as well as if you are beating or leveling the required performance threshold (≥30%) using the $r^2$ evaluation metric.

### 7. Present your solution

Communicate your findings;

* Were you able to beat the required performance threshold?

* What are the implications if your model is deployed to real-world use?

* Can your model's results be interpreted? Can it to made transparent accessible if needed by the stakeholders of the department or company?

* Please conculde the notebook and present it to me like I am Smith, because my job is on the line and I'd love to make sure I know the strenghts and limitations of your solution so I can prepare a proper proposal to the company.

### Past Materials You Can Reference


1. Getting Started With Machine Learning (No, "Practical" Machine Learning!) [here](https://blog.phcschoolofai.org/getting-started-with-machine-learning-no-practical-machine-learning-ck9bnkmdm03ascss1b28lqh70).


2. Getting Started With Machine Learning (No, "Practical" Machine Learning!) (Part 2) [here](https://blog.phcschoolofai.org/getting-started-with-machine-learning-no-practical-machine-learning-part-2-ck9ieyx7b083zcss1c7trnf4r).

3. Getting Started With Machine Learning (No, "Practical" Machine Learning!) (Part 3) [here](https://blog.phcschoolofai.org/getting-started-with-machine-learning-no-practical-machine-learning-part-3-ck9mf61vj002qihs1iwkzaedn).

### Goodluck!