### Using Pandas to Get Familiar With Our Data

Pandas is a powerful data analysis Python library that is built on top of numpy which is yet another library that let’s you create 2d and even 3d arrays of data in Python. 

The pandas main object is called a dataframe. A dataframe is basically a 2d numpy array with rows and columns, that also has labels for columns and rows.

In [3]:
import pandas as pd

# Load data
melbourne_data_path = 'data/melb_data.csv' # path to the data
melbourne_data = pd.read_csv('melbourne_data_path') # load data into a DataFrame

# print(melbourne_data.columns)
melbourne_data.describe()


Unnamed: 0,Rooms,Price,Distance,Postcode,Bedroom2,Bathroom,Car,Landsize,BuildingArea,YearBuilt,Lattitude,Longtitude,Propertycount
count,13580.0,13580.0,13580.0,13580.0,13580.0,13580.0,13518.0,13580.0,7130.0,8205.0,13580.0,13580.0,13580.0
mean,2.937997,1075684.0,10.137776,3105.301915,2.914728,1.534242,1.610075,558.416127,151.96765,1964.684217,-37.809203,144.995216,7454.417378
std,0.955748,639310.7,5.868725,90.676964,0.965921,0.691712,0.962634,3990.669241,541.014538,37.273762,0.07926,0.103916,4378.581772
min,1.0,85000.0,0.0,3000.0,0.0,0.0,0.0,0.0,0.0,1196.0,-38.18255,144.43181,249.0
25%,2.0,650000.0,6.1,3044.0,2.0,1.0,1.0,177.0,93.0,1940.0,-37.856822,144.9296,4380.0
50%,3.0,903000.0,9.2,3084.0,3.0,1.0,2.0,440.0,126.0,1970.0,-37.802355,145.0001,6555.0
75%,3.0,1330000.0,13.0,3148.0,3.0,2.0,2.0,651.0,174.0,1999.0,-37.7564,145.058305,10331.0
max,10.0,9000000.0,48.1,3977.0,20.0,8.0,10.0,433014.0,44515.0,2018.0,-37.40853,145.52635,21650.0


#### Interpreting Data Description

- The results show 8 numbers for each column in our original dataset. The first number, the count, shows how many rows have non-missing values.

- Missing values arise for many reasons. For example, the size of the 2nd bedroom wouldn't be collected when surveying a 1 bedroom house. We'll come back to the topic of missing data.

- The second value is the mean, which is the average. Under that, std is the standard deviation, which measures how numerically spread out the values are.

- To interpret the min, 25%, 50%, 75% and max values, imagine sorting each column from lowest to highest value. The first (smallest) value is the min. If you go a quarter way through the list, you'll find a number that is bigger than 25% of the values and smaller than 75% of the values. That is the 25% value (pronounced "25th percentile"). The 50th and 75th percentiles are defined analogously, and the max is the largest number.

### Lets go to build our first model

### Selecting Data for Modeling

Our dataset had too many variables to wrap our head around, or even to print out nicely. To choose variables/columns, we'll need to see a list of all columns in the dataset. That is done with the columns property of the DataFrame (the bottom line of code below).

In [7]:
print(melbourne_data.columns)

Index(['Suburb', 'Address', 'Rooms', 'Type', 'Price', 'Method', 'SellerG',
       'Date', 'Distance', 'Postcode', 'Bedroom2', 'Bathroom', 'Car',
       'Landsize', 'BuildingArea', 'YearBuilt', 'CouncilArea', 'Lattitude',
       'Longtitude', 'Regionname', 'Propertycount'],
      dtype='object')


> Note: The Melbourne data has some missing values (some houses for which some variables weren't recorded.) Our Iowa data doesn't have missing values in the columns you use. So we will take the simplest option for now, and drop houses from our data. Don't worry about this much for now, though the code is:

In [8]:
# dropna drops missing values (think of na as "not available")
melbourne_data = melbourne_data.dropna(axis=0)

#### We will focus on two approaches for now.

> 1. Dot notation, which we use to select the `"prediction target"`
> 2. Selecting with a column list, which we use to select the `"features"`

### Selecting The `Prediction Target`

The prediction target is the house price which we want to predict. We will save this to a new variable called `y`.

You can pull out a variable with dot-notation. This single column is stored in a Series, which is broadly like a DataFrame with only a single column of data.

We'll use the dot notation to select the column we want to predict, which is called the prediction target. By convention, the `prediction target` is called `y`. So the code we need to save the house prices in the Melbourne data is

In [9]:
# Selecting The Prediction Target
y = melbourne_data.Price

### Choosing `"Features"`

The columns that are inputted into our model (and later used to make predictions) are called "features." In our case, those would be the columns used to determine the home price. Sometimes, you will use all columns except the target as features. Other times you'll be better off with fewer features.

For now, we'll build a model with only a few features. Later on you'll see how to iterate and compare models built with different features.

We select multiple features by providing a list of column names inside brackets. Each item in that list should be a string (with quotes).

In [11]:
# Choosing "Features" list of columns to use
melbourne_features = ["Rooms", "Bathroom", "Landsize", "Lattitude", "Longtitude"]

In [13]:
# By convention, this data is called X
X = melbourne_data[melbourne_features]

Let's quickly review the data we'll be using to predict house prices using the describe method and the head method, which shows the top few rows.

In [15]:
# Review the data we'll be using to predict house prices
X.describe()

Unnamed: 0,Rooms,Bathroom,Landsize,Lattitude,Longtitude
count,6196.0,6196.0,6196.0,6196.0,6196.0
mean,2.931407,1.57634,471.00694,-37.807904,144.990201
std,0.971079,0.711362,897.449881,0.07585,0.099165
min,1.0,1.0,0.0,-38.16492,144.54237
25%,2.0,1.0,152.0,-37.855438,144.926198
50%,3.0,1.0,373.0,-37.80225,144.9958
75%,4.0,2.0,628.0,-37.7582,145.0527
max,8.0,8.0,37000.0,-37.45709,145.52635


In [16]:
X.head()

Unnamed: 0,Rooms,Bathroom,Landsize,Lattitude,Longtitude
1,2,1.0,156.0,-37.8079,144.9934
2,3,2.0,134.0,-37.8093,144.9944
4,4,1.0,120.0,-37.8072,144.9941
6,3,2.0,245.0,-37.8024,144.9993
7,2,1.0,256.0,-37.806,144.9954


Visually checking our data with these commands is an important part of a data scientist's job. 
We'll frequently find surprises in the dataset that deserve further inspection.
- `describe()` method shows some interesting statistics about our data.
- `head()` method displays the first few lines of our data.


### Building Our Model

We will use the `scikit-learn library` to create our models. When coding, this library is written as sklearn, as you will see in the sample code. Scikit-learn is easily the most popular library for modeling the types of data typically stored in DataFrames.

The steps to building and using a model are:

- `Define`: What type of model will it be? A decision tree? Some other type of model? Some other parameters of the model type are specified too.
  
- `Fit`: Capture patterns from provided data. This is the heart of modeling.
  
- `Predict`: Just what it sounds like
  
- `Evaluate`: Determine how accurate the model's predictions are.

Here is an example of defining `a decision tree model with scikit-learn and fitting` it with the features and target variable.

In [17]:
from sklearn.tree import DecisionTreeRegressor

# Define model. Specify a number for random_state to ensure same results each run
melbourne_model = DecisionTreeRegressor(random_state=1)

# Fit model
melbourne_model.fit(X, y)

Many machine learning models allow some randomness in model training. Specifying a number for random_state ensures we get the same results in each run. This is considered a good practice. We use any number, and model quality won't depend meaningfully on exactly what value we choose.

We now have a fitted model that we can use to make predictions.

In practice, we'll want to make predictions for new houses coming on the market rather than the houses we already have prices for. But we'll make predictions for the first few rows of the training data to see how the predict function works.

In [18]:
print("Making predictions for the following 5 houses:")
print(X.head())
print("The predictions are")
print(melbourne_model.predict(X.head()))

Making predictions for the following 5 houses:
   Rooms  Bathroom  Landsize  Lattitude  Longtitude
1      2       1.0     156.0   -37.8079    144.9934
2      3       2.0     134.0   -37.8093    144.9944
4      4       1.0     120.0   -37.8072    144.9941
6      3       2.0     245.0   -37.8024    144.9993
7      2       1.0     256.0   -37.8060    144.9954
The predictions are
[1035000. 1465000. 1600000. 1876000. 1636000.]
