**Below is the steps we would go through to load, view and visualize any csv data.**  
  
**STEP 1: ADDING PACKAGES**  
  
**We add python packages we require.**  

In [2]:
import numpy  as np             # easy to play with arrays etc.
import pandas as pd             # required to load and read data and put in dataframe.
import matplotlib.pyplot as plt # required for data visualization purposes.
import seaborn as sns           # required for data visualization purposes.
import plotly.plotly as py      # required for data visualization purposes.
import plotly.graph_objs as go  # required for data visualization purposes.
from IPython.display import display, HTML
from plotly.offline import init_notebook_mode, iplot
init_notebook_mode(connected=True)

**STEP 2: READING IN A DATASET**  
  
**To read data in the form of .xls or comma seperated file, you need pd.read_excel()**  

**Delimiter can be set as well whether header or index column is required.**  

**http://gsociology.icaap.org/dataupload.html is source of data**  
    

****


**STEP 3: VIEWING THE FIRST FEW ROWS**  
  

**To see the first few rows of the data and make sure we read it in correctly, we use .head()**


In [3]:
#excel_file = "http://gsociology.icaap.org/data/UN_BirthDeathMigration.xlsx"
excel_file = "causes_of_death2.xlsx"
data = pd.read_excel(excel_file)
data.head()

Unnamed: 0,"Births, Deaths, Net Migration, 1950 to 2100",Unnamed: 1,Unnamed: 2,Unnamed: 3,Unnamed: 4,Unnamed: 5,Unnamed: 6,Unnamed: 7,Unnamed: 8,Unnamed: 9,...,Unnamed: 87,Unnamed: 88,Unnamed: 89,Unnamed: 90,Unnamed: 91,Unnamed: 92,Unnamed: 93,Unnamed: 94,Unnamed: 95,Unnamed: 96
0,,b50-55,b55-60,b60-65,b65-70,b70-75,b75-80,b80-85,b85-90,b90-95,...,m50-55,m55-60,m60-65,m65-70,m70-75,m75-80,m80-85,m85-90,m90-95,m95-00
1,Country or Area,1950-1955,1955-1960,1960-1965,1965-1970,1970-1975,1975-1980,1980-1985,1985-1990,1990-1995,...,2050-2055,2055-2060,2060-2065,2065-2070,2070-2075,2075-2080,2080-2085,2085-2090,2090-2095,2095-2100
2,Africa,1.15858e+07,12920715,14460425,1.61173e+07,1.81833e+07,2.0536e+07,2.3008e+07,2.53157e+07,2.72664e+07,...,-344302,-293260,-250468,-212156,-175463,-138985,-104560,-70112.6,-36388,-3801.2
3,Asia,6.17152e+07,6.39118e+07,6.91948e+07,76666011,7.82249e+07,7.45254e+07,79797062,8.55579e+07,82732165,...,-893029,-842434,-769528,-682964,-587950,-494655,-392600,-292060,-192290,-92607
4,Australia/New Zealand,252077,279659,301107,299319,321580,281855,287261,302604,317138,...,50629.2,42928.6,36621.6,31028.4,25867.2,20755.4,15983,11188.2,6272.4,0


**STEP 4: GET BASIC INFORMATION**  
  
**To get basic info from the dataset, we use .info()**

In [None]:
data.info()

**STEP 5: SEE FURTHER DETAILS**  
  
**To get datatypes of each column, we can use .dtypes**  

**To get more details about each column, we can use .describe()**  
  
The reason we only get data from 3 columns is because the rest have commas in them which need to be removed
We can deal with this later while cleaning.

In [None]:
print data.dtypes
print "\n"
display(HTML(data.describe().to_html()))


**STEP 6: COUNT NUMBER OF EMPTY VALUES IN COLUMN**  
   
**We can check the number of null values a column has by using .isnull().sum()**  
  
**For example, here, Climate has the most null values **  

In [None]:
print data.isnull().sum()

**STEP 7: SEE NUMBER OF UNIQUE VALUES IN COLUMN**  
  
**It is useful to see the number of unique values in each column using .nunique()**  
  
**Here we see region and climate have a good number of unique values to order by, therefore we can group by these columns and make good visualizations**  

In [None]:
print data.nunique()
group1 = data.groupby("Cause Name")['Age-adjusted Death Rate'].agg(np.mean)
print group1
print "\n"
group2 = data.groupby("Year")
display(HTML(group2.head().to_html()))
print "\n"
group3 = data.groupby("State")
display(HTML(group3.head().to_html()))

**STEP 8: PLOT WHOLE DATESET**  
  
**Let us try to visualize all the data at once**  
  
**After that, we count the number of countries in each region and plot it and do the same for climate. **  

In [None]:
a = data.plot()
plt.show()

**STEP 9: NUMBER OF OCCURANCES OF EACH VALUE IN COLUMN**  
  
**A good way to visualize data of a column you wish to group by is to use .value_counts()**  
  
**It gives a clear picture of how many would be in each group etc.**  

In [None]:
data['Cause_Name']=data['Cause Name']
print data.Cause_Name.head()
region = data.Cause_Name.value_counts()
print region
climate = data.Year.value_counts()
print climate

**STEP 10: VISUALIZATION**  
  
**Using matplotlib.pyplot to make bar charts**

In [None]:
plt.figure(figsize=(10,7))
plt.bar(np.arange(len(region.index)),group1.values)
plt.xticks(np.arange(len(region.index)), group1.index)
plt.xticks(rotation=90)
plt.ylabel('Average Age')
plt.xlabel('Cause Of Death')
plt.show()

**STEP 10: CLEANING DATA**  
  
**To see more from the data it has to be cleaned. Cleaning data is usally unique to each dataset.**  

**In this instance, we can see that many of the columns have commas where periods should be. To go about changing our data to a desirable format, we need to access the columns. Specifically, we need to remove spaces, brackets, symbols etc. 
Here, seems unnecessary 
**

In [None]:
print data.dtypes
print "\n"
display(HTML(data.describe().to_html()))

**STEP 11: HEATMAP OF CORRELATION BETWEEN COLUMNS**  
  
**When two sets of data are strongly linked together we say they have a High Correlation. To see corr between all the columns, we use .corr()**

In [None]:
f,ax = plt.subplots(figsize=(15, 13))
sns.heatmap(data.corr(), annot=True, ax=ax)
plt.show()