This is a Data Analysis project done as part of a Data Science Course. Analysis is done on the data set containing the House Sales in King County, USA, which includes Seattle. It includes homes sold between May 2014 and May 2015.
Download the entire dataset here : https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBMDeveloperSkillsNetwork-DA0101EN-SkillsNetwork/labs/FinalModule_Coursera/data/kc_house_data_NaN.csv
The project is done on Jupyter Notebook using Python.
The libraries used are :
- pandas
- matplotlib
- seaborn
- scikitlearn
The aims of the project were:
- Obtaining a statistical summary of the dataframe
- Replacing the missing values in the dataset with the mean values
- Comparing the outliers in the sales of houses with and without waterfont view
- Finding the correlation of certain feature with the price of the house
- Creating linear regression models to predict price using different features and finding the coefficient of determination
- Splitting the dataset into test samples and training samples.
- Creating ridge regression object using the test data and training data and finding the coefficient of determination, thus, evaluating the model and refining it if required.