# Part 2: Exploring your data

Before we even think about modelling, we need to understand what we are dealing with. As we have limited time, we will context some exploration using visualisations.

We will use a library called matplotlib, which is another open sourced project. It is a general purpose plotting library, inspired by matlab.

For a regression task, we want to be able to predict out outcome (Compressive Strength) based on our inputs. Lets try and see if there are any obvious relations between the features and the outcome.

In [None]:
# load in pandas, matplotlib and numpy
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np

# cell 'magic' which allows us to render our plots in the notebook
%matplotlib inline

## Some basic plotting

In [None]:
# define two numpy arrays for our x and y data
x = np.arange(10)
y = x * 3

print("x, y pairs are: {}".format(list(zip(x, y))))

In [None]:
plt.plot(x, y);

In [None]:
plt.scatter(x, y);

In [None]:
plt.scatter(x, y, color = "red", label = "i am a label")
plt.xlabel("hat")
plt.ylabel("cat")
plt.legend()
plt.title("I am a title");

Again, matplotlib contains a huge functionality, and the [API documentation](https://matplotlib.org/) is the best place to get started. Also consider these [SciPy lecture notes](https://www.scipy-lectures.org/).

We will use matplotlib to explore our data. Matplotlib works natively with numpy arrays of data, but has in built flexibility to work with pandas as well.

In [None]:
# import our data
concrete = pd.read_csv("processed/concrete_processed.csv")
concrete.head()

In [None]:
# lets visualise cement vs compressive strength
# alpha controls the transparency
plt.scatter(x = concrete.loc[:, "Cement"], y = concrete.loc[:, "CompressiveStrength"], alpha = 0.6)
plt.xlabel("Cement")
plt.ylabel("Compressive Strength");

What can we say about the relationship between the amount of cement in the mixture and compressive strength?

## Exercise: Explore the relationships between the features and the outcome

Discuss with your neighbor which features you think may be important to predict compressive strength based on the mixture.

In [None]:
# Your code here





# Questions

- What can we say about the relationship between the features and the outcome? Do we see any non-linear relationships?

- Do all features look useful

- Do some mixtures lack ingredients?

You may wonder how you can model the compressive strength of the mixture based on the ingredients using such messy looking data. In the next section, we will use a powerful machine learning algorithm that can combine the features to accurately predict the outcome.


In [None]:
# DO NOT RUN THIS CELL UNTIL EVERYONE HAS FINISHED

# from funcs.plot_fun import scatter_plot
# scatter_plot(concrete, 3, "CompressiveStrength")