# Introduction

AmeHouse real state has given us two files, which have a **.csv** extension. 

* train.csv --> We will use it to build or train our model
* test.csv --> A dataset used to predict the price of a set of houses. 

These files are a type of document called .csv, which is a simple open format to save the data in the form of a table, in which the columns are separated by commas.

Python has a library called **Pandas** with which the data load becomes an easy task: import the package as pd, following the convention, and use the function read_csv(), which passes the path in which the data can be found and a header argument. This last argument is one you can use to make sure your data is read correctly: the first row of your data will not be interpreted as the column names of your DataFrame.

#### Let's first import the libraries that we are going to use in this step

In [1]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib
import matplotlib.pyplot as plt

#### Load the two files

In [2]:
# Import the train, so that the data that we are going to use to build / train the model
train_path = r'./input/train.csv'
train = pd.read_csv(train_path)

# Import the test, so that the data that we are going to use to predict and assess the quality of the model

test_path = r'./input/test.csv'
test = pd.read_csv(test_path)

# test["SalePrice"] = np.nan # we don't have target values for the test

It is time now to explore and have a look at the data that we already import. There are a few questions that we need to figure out. For a better understanding, they will be splitted in different sections. Ready?

**Which is the size of our data? Do we have enough information?**

The size of our dataset is really important, because it is crutial to have the maximum amount of information to develop our prediction model. Having a poor dataset will lead to wrong conclusions, let's see how can we know the size of our data.

In [3]:
print ("Size of train data : {}" .format(train.shape))

print ("Size of test data : {}" .format(test.shape))

Size of train data : (1460, 81)
Size of test data : (1459, 80)


Both the *train dataset* has 1460 rows or houses, 81 columns or variables. We have the same number of ros (flats) to train the model and to assess it, being that number big enough for the modelling. On the other hand, the number of variables is quite big, and for sure that there are some of them who are correlated (in a positive or negative way). We will discuss how to select the most important variables on the next Step.

On the other hand the *test dataset* has 1549 rows and only 80 variables. basically we can see that there is one column less than in the train set, the one that is missing is because we mus the one that we should predict (price of the houses).

**Which is the information contained in our dataset?**

At this step, we will perform this first analysis with the train dataset. As commented below, we will build our prediction model with this dataset so we will use it to base our first exploration of the data. 

In [4]:
#The info method gives us information about our dataset such as the number of values for each variable, 
# which is the type of each column etc
train.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1460 entries, 0 to 1459
Data columns (total 81 columns):
Id               1460 non-null int64
MSSubClass       1460 non-null int64
MSZoning         1460 non-null object
LotFrontage      1201 non-null float64
LotArea          1460 non-null int64
Street           1460 non-null object
Alley            91 non-null object
LotShape         1460 non-null object
LandContour      1460 non-null object
Utilities        1460 non-null object
LotConfig        1460 non-null object
LandSlope        1460 non-null object
Neighborhood     1460 non-null object
Condition1       1460 non-null object
Condition2       1460 non-null object
BldgType         1460 non-null object
HouseStyle       1460 non-null object
OverallQual      1460 non-null int64
OverallCond      1460 non-null int64
YearBuilt        1460 non-null int64
YearRemodAdd     1460 non-null int64
RoofStyle        1460 non-null object
RoofMatl         1460 non-null object
Exterior1st      1460 non-n

Two different analysis can be extracted from this output:
- We can divide the variables that we have into two types: numerical and categorical. The numerical variables contain numbers and in Python are described as: *non-null int64*. We also have categorical variables, which means that these variables contain "characters" or "letters, described in Python as: *non-null object*. 

- Some variables are not going to be relevant or important in our next steps. The number that is close to the type tells you how many houses (or rows) contain items in it. That means, that it counts only the real items and discards the ones that contain missing information. We will call this missing information NaN, or Not a Number. For instance, Alley and PoolQC have respectively 91 and 7 filled rows. As there is so much features to analyse that it may be better to concentrate on the ones which can give us real insights. Let's just remove houses (Id) and the variables with 30% or less NaN values.

In [6]:
# We will define a new train dataset in which the total number of Real items in the columns are higher or equal to 30%
train2 = train[[column for column in train if train[column].count() / len(train) >= 0.3]]

del train2['Id']
print("List of dropped columns:", end=" ")
for c in train.columns:
    if c not in train2.columns:
        print(c, end=", ")
print('\n')
train = train2

List of dropped columns: Id, Alley, PoolQC, Fence, MiscFeature, 



**How many variables do we have after this drop?**

In [7]:
print ("Size of train data : {}" .format(train.shape))

train.get_dtype_counts()

Size of train data : (1460, 76)


int64      34
object     39
float64     3
dtype: int64

Now we have reduced the number of variables from 81 to 76, and we have 34 numerical variables (int64) and 39 categorical variables (object)

In [10]:
# Save the filtered dataset:
import os 
path = r'./input/'
os.chdir(path)
train.to_csv('Train_Filtered.csv')

Now that we already know our dataset it is time to go further and start the *Exploratory data analysis (EDA)*. 