# __INCREMENTAL CAPSTONE__
<br>
<br>  

## __WEEK 1__

__Task: Import and export data, clean data.__

1. Import relevant python libraries necessary for Python programming and Numpy for doing Numerical operations.
2. Import the CSV file – NSMES1988.csv into a dataframe.
3. Inspect the data and report the details from physical inspection – rows, columns, data types etc.
4. Find out if the data is clean or if the data has missing values.
5. Comment on the data types, their values and their range, specifically on age and income columns.
6. Export the data to JSON as NSMES1988.json format file and view and enter your comments.
7. Perform memory information on the data and recommend what non-default data types you would recommend to optimize memory settings for the dataframe.
8. What changes you would recommend on the dataframe before attempting a detailed data analysis.
9. Export the data frame as a new CSV file NSMES1988new.csv and store it in the local space for possible use in other assignments.
10. Write a short report on the visual observations of the data.

__1. Import relevant python libraries necessary for Python programming and Numpy for doing Numerical operations.__

In [3]:
import numpy as np
import pandas as pd

<br>

__2. Import the CSV file – NSMES1988.csv into a dataframe.__

In [4]:
df = pd.read_csv('../data/NSMES1988.csv')

<br>

__3. Inspect the data and report the details from physical inspection – rows, columns, data types etc.__

In [None]:
df.head()

In [None]:
df.tail()

In [None]:
df.shape

In [None]:
df.info()

In [None]:
df.describe()

In [None]:
df.dtypes

## Drop column "Unnamed: 0"

In [None]:
df.head()

In [14]:
df.drop(columns=["Unnamed: 0"],inplace=True)
# Or, alternatively, reassign: df = df.drop(columns="Unnamed: 0")

In [None]:
df.head()

<br>

__4. Find out if the data is clean or if the data has missing values.__

In [None]:
# Obtain the number of missing values for each column
df.isnull().sum()

In [None]:
# Equivalently, (axis=0 is the default)
# axis=0 means that we sum over the rows, for each columns
# It means that we sum all the values for each column
df.isnull().sum(axis=0)

In [None]:
# Equivalently,
df.isna().sum()

In [None]:
# Remark
null_values_series = df.isna().sum()
null_values_series

In [None]:
type(null_values_series)

__Conclusion: there is no missing value.__

In [None]:
# Remark: notna is the contrary of isna
df.notna()

In [None]:
df.notna().sum()

<br>

__5. Comment on the data types, their values and their range, specifically on age and income columns__

In [None]:
df[['age','income']].dtypes

__Remark: the types of the age and income are ok.__

In [None]:
df.head(2)

In [None]:
print(df['age'].min(),df['age'].max())

In [None]:
print(df['income'].min(),df['income'].max())

In [None]:
df[['age','income']].describe()

<br>

__6. Export the data to JSON as NSMES1988.json format file and view and enter your comments.__

In [32]:
df.to_json('NSMES1988.json')

<br>

__7. Perform memory information on the data and recommend what non-default data types you would recommend to optimize memory settings for the dataframe.__

In [None]:
df.info()

In [None]:
# More accurate memory estimate
df.info(memory_usage='deep')

<br>

__8. What changes you would recommend on the dataframe before attempting a detailed data analysis.__

In [None]:
# Let's analyze the integers; create a dataframe that contains only the integer features
df2 = df[['visits','nvisits','ovisits','novisits','emergency','hospital','chronic','school']]
df2.head()

In [None]:
df2.info()

In [None]:
df2.max()

In [None]:
df2 = df2.astype('int16')
df2.info()

In [None]:
df2.max()

__The maximum values for the integers has not changed. We did not lose any information when converting from 64 bits to 16 bits.__ 

In [None]:
# Let's analyze some "float" features
df3 = df[['age','income']]
df3.head()

In [None]:
df3.info(memory_usage='deep')

In [None]:
df3.astype('float32').head()

In [None]:
df3.astype('float32').info()

__The values for the floats have not changed when converting from 64 bits to 32 bits. We did not lose any information when converting from 64 bits to 16 bits.__ 

## Conclusion:
## To save memory, we recommend changing all the int64 to int16 and change from float64 to float32 for age and income.
<br>


<br>

__9. Export the data frame as a new CSV file NSMES1988new.csv and store it in the local space for possible use in other assignments.__

In [62]:
df.to_csv('NSMES1988new.csv',index=False)

In [None]:
df1= pd.read_csv('NSMES1988new.csv')
df1.head()

<br>

__10. Write a short report on the visual observations of the data.__

In this exercise, we did a top-level data analysis, after reading the provided CSV file. After the inspection of the data, we understood that there are many int64 columns that could have been int8, as the range demands that. Similarly, two float columns, which could have been just float16.
We also had a deep look at two columns age and income and analysed.
Also, we exported the CSV to JSON format and wrote it locally as a JSON file, and visually analysed it.

<br>

## __WEEK 2__

__Task: Perform linear algebraic operations.__

 
1. Import relevant python libraries.
2. Import the CSV file – NSMES1988new.csv into a dataframe, and enforce the dtypes recommended in step 9 of Week 1.
3. Perform memory analysis of the new dataframe and compare it with the memory of the dataframe in the previous week and mark your comments.
4. Perform the following operations on age and income columns. Multiply Age by 10 and income by 10000. 
5. Perform basic statistical analysis on the new dataframe and generate a brief report on the outcome. Save the dataframe as NSMES1988updated.csv file in the local space for possible future use.
6. Invoke describe command on the dataframe and compare that with the basic statistics analysis done in the previous step, and report.
7. Indicate which of the columns are not eligible for statistical analysis and indicate possible datatype changes, and report. 
8. Make changes to the recommended in the previous step, export it as a new .csv file for possible future use (Optional).
9. Prepare a brief report and enter it in the mark-up cells of JupyterLab Notebook.

<br>

__1. Import relevant python libraries.__

In [38]:
import numpy as np
import pandas as pd

<br>

__2. Import the CSV file – NSMES1988new.csv into a dataframe, and enforce the dtypes recommended in step 9 of Week 1.__

In [39]:
df = pd.read_csv('../data/NSMES1988.csv')

In [None]:
df.info(memory_usage='deep')

In [41]:
df_save=df

In [None]:
df = df.select_dtypes(include=['float64', 'int64'])
float_columns = df.select_dtypes(include=['float64']).columns
int_columns = df.select_dtypes(include=['int64']).columns
df[int_columns] = df[int_columns].astype('int16')
df[float_columns] = df[float_columns].astype('float32')
df.info(memory_usage='deep')

<br>

__3. Perform memory analysis of the new dataframe and compare it with the memory of the dataframe in the previous week and mark your comments.__

In [None]:
df.info(memory_usage='deep')

<br>

__4. Perform the following operations on age and income columns. Multiply Age by 10 and income by 10000.__

In [None]:
df["age"] = df["age"] * 10
df["income"] = df["income"] * 10_000
df.head()

<br>

__5. Perform basic statistical analysis on the new dataframe and generate a brief report on the outcome. Save the dataframe as NSMES1988updated.csv file in the local space for possible future use.__

In [None]:
df.describe()

In [None]:
df.max()

In [47]:
df.to_csv('../data/NSMES1988_2.csv', index=False)

In [None]:
test_df = pd.read_csv('../data/NSMES1988_2.csv')
test_df.head()

In [None]:
test_df.describe()

In [None]:
df1= pd.read_csv('NSMES1988new2.csv')
df1.head()

<br>

__6. Invoke describe command on the dataframe and compare that with the basic statistics analysis done in the previous step, and report.__

In [None]:
df.describe()

<br>

__7. Indicate which of the columns are not eligible for statistical analysis and indicate possible datatype changes, and report.__

Unnamed: 0: removed it during week 1
Convert int32 => int16 and float64 to float32

<br>

__8. Make changes to the recommended in the previous step, export it as a new .csv file for possible future use (Optional).__

<br>

__9. Prepare a brief report and enter it in the mark-up cells of JupyterLab Notebook.__

From the above analysis, we have been able to determine the following. There are totally 20 columns and 40406 rows of data. Of the 20 columns, we need to ignore column named 'Unnamed: 0', as it just represents row numbers. As a part of the given task, we did the memory optimization of the dataset, and could see substantial optimization can be done, provided we could fix the dtypes, based on the range.
Further, we did two math operations, one on column called age, and the other one on the column called income. We multiplied age by 10 and income by 10,000, as the given dataset, previously had that factor included.
Next, we performed basic statistics operation on the relevant numerical columns and obtained the basic statistics, and compared them with the values obtained from describe() function. We then saved the new dataframe with updated values for next set of tasks.
