## Regression Machine Learning Project

*Using the California housing Dataset for a Regression Problem. The target variable is the median house value for California districts, expressed in hundreds of thousands of dollars ($100,000).

### Getting the California housing Dataset

* This will be fetched using the Sklearn import 
* Import it and putting it in a Dataframe

In [21]:
#Fetching the dataset from sklearn

#Library
from sklearn.datasets import fetch_california_housing

#Fetching the data(dictionary) into a variable from the fetch_california_housing function
housing = fetch_california_housing()
housing

{'data': array([[   8.3252    ,   41.        ,    6.98412698, ...,    2.55555556,
           37.88      , -122.23      ],
        [   8.3014    ,   21.        ,    6.23813708, ...,    2.10984183,
           37.86      , -122.22      ],
        [   7.2574    ,   52.        ,    8.28813559, ...,    2.80225989,
           37.85      , -122.24      ],
        ...,
        [   1.7       ,   17.        ,    5.20554273, ...,    2.3256351 ,
           39.43      , -121.22      ],
        [   1.8672    ,   18.        ,    5.32951289, ...,    2.12320917,
           39.43      , -121.32      ],
        [   2.3886    ,   16.        ,    5.25471698, ...,    2.61698113,
           39.37      , -121.24      ]]),
 'target': array([4.526, 3.585, 3.521, ..., 0.923, 0.847, 0.894]),
 'frame': None,
 'target_names': ['MedHouseVal'],
 'feature_names': ['MedInc',
  'HouseAge',
  'AveRooms',
  'AveBedrms',
  'Population',
  'AveOccup',
  'Latitude',
  'Longitude'],
 'DESCR': '.. _california_housing_dataset:\n

#### The result fetch the data in a form of a dictionary. Next step will be to convert the data to a dataframe

### Loading Data into Dataframe

* Converting the loaded dictionary data above into a dataframe
    * **housing["data"]** contains the entire data record minus the target data
    * **columns = ["feature_names"]** will be used to pull the features(column) names minus the target variables
    * **housing["target"]**  contains the target variable

In [22]:
#Importing Pandas
import pandas as pd

#creating the data frame
housing_df = pd.DataFrame(housing["data"], columns = housing["feature_names"])

In [23]:
#Adding the target variable
housing_df["target"] = housing["target"] 

### Viewing the newly created DataFrame

In [24]:
housing_df.head()

Unnamed: 0,MedInc,HouseAge,AveRooms,AveBedrms,Population,AveOccup,Latitude,Longitude,target
0,8.3252,41.0,6.984127,1.02381,322.0,2.555556,37.88,-122.23,4.526
1,8.3014,21.0,6.238137,0.97188,2401.0,2.109842,37.86,-122.22,3.585
2,7.2574,52.0,8.288136,1.073446,496.0,2.80226,37.85,-122.24,3.521
3,5.6431,52.0,5.817352,1.073059,558.0,2.547945,37.85,-122.25,3.413
4,3.8462,52.0,6.281853,1.081081,565.0,2.181467,37.85,-122.25,3.422


### Features and Target Information

* Here is the data dictionary 
    * MedInc median income in block group
    * HouseAge median house age in block group
    * AveRooms average number of rooms per household
    * AveBedrms average number of bedrooms per household
    * Population block group population
    * AveOccup average number of household members
    * Latitude block group latitude
    * Longitude block group longitude

### Split the Dataset

* Putting the feature data in X
* putting the target data in y

In [28]:
#Feature data
X = housing_df.drop("target", axis=1)
#target data
y = housing_df["target"]

In [31]:
#import library for splitting
from sklearn.model_selection import train_test_split

#split the data into train and test
X_train, x_test, y_train, y_test = train_test_split(X,
                                                   y,
                                                   test_size = 0.2)

### Choosing the Algorithm

* Using Ridge model

In [34]:
from sklearn.linear_model import Ridge

#Initializing the model
regmodel = Ridge()
#Fitting the model
regmodel.fit(X_train,y_train)

#Scoring the model
regmodel.score(x_test, y_test)

0.614709997309058

### Choosing the Algorithm

* Using Random Forest

In [36]:
#importing the library
from sklearn.ensemble import RandomForestRegressor

#initializing the model
ranfog = RandomForestRegressor()

#fitting the model
ranfog.fit(X_train,y_train)

#scoring the model
ranfog.score(x_test,y_test)


0.8137454217425448

### Conclusion

#### From the above, The random forest model scored higher than the Ridge model