<h1><center> Data Visualisation Workbook </font></center></h1>
<h3>Hello everyone! This is a workbook designed to put the knowledge you've learnt about data cleaning and visualisation into practice :) </h3>

<h2> Downloading the data</h2>
<p>To start things off, we first need to download the dataset that you will be using and modifying. The dataset provided to you,uses real-world data on the effect alcohol has on dementia risk, throughout the world. Hopefully by the end of this workbook you'll be able to notice patterns and trends within the data and even think about the possibilities as to why these trends might occur! <br><br><b> The link to the dataset is: </b> <a href="https://github.com/charlieblindsay/icsm/blob/master/blood_pressure.ipynb">here</a>
<br><br> This link should take you to the dataset/file we want you to be working with, which is part of the GitHub repository for ICSM Coding.<br> Download the ALCOHOL dataset onto your computer, and store it in a place which you know you can easily access. This could be through making your own folder in your documents. </p>

<p>------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------</p>
<h2> Contents for Instructions </h2>

   <ol>
       <li>Installing relevant python packages</li>
       <li>Importing relevant python packages</li>
       <li>Opening the dataset</li>
       <li>Cleaning the dataset</li>
       <li>Transposing dataset to Seaborn format</li>
       <li>Creating plots from the dataset</li>


<p>------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------</p>
<h2>1.0 Installing Relevant Python Packages </h2>

<p> Before starting to work on any sort of code on python, it is very important to install the relevant python packages you might need to work with in order for your code to work. Each of these packages have a set of tools that can execute different functions that might not have been otherwise available in basic python. <br><br> To work with data-sets the <b>most common </b> packages that need to be installed are:
    <ul>
        <li><b>Numpy </b>(to work with numbers)</li>
        <li><b>matplotlib </b>(to make basic graphs/ plots)</li>
        <li><b>pandas </b>(to manipulate datasets)</li>
        <li><b>seaborn </b>(to make more advanced graphs/ plots)</li>
    </ul>

<br>
<b>In the code cell below, can you please install the above 4 python packages we will be using:</b> </p>

In [1]:
import sys
!{sys.executable} -m pip install numpy
!{sys.executable} -m pip install matplotlib
!{sys.executable} -m pip install pandas
!{sys.executable} -m pip install sns



<p>------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------</p>
<h2> 2.0 Importing relevant Python packages </h2>

<p> Now that we've successfully installed the packages, we need to import them onto this workbook for us to use. This allows us to access these packages and actually use them in our code. Everytime you make a separate python file, you should be importing the relevant packages.
    
<br><b> In the cell below, can you import: </b>
    <ul>
        <li>numpy as np</li>
        <li>pandas as pd</li>
        <li>matplotlib.pyplot as plt</li>
        <li>seaborn as sns</li>
    </ul>

</p>

In [2]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

<p>------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------</p>
<h2>3.0 Opening the dataset</h2>
<p>Now that we've got python ready to code with, and we've downloaded the alcohol dataset onto our computer, we now need to open and read that dataset onto python. For the dataset to load on python, it HAS to be in a <b>comma separated value (csv)</b> format.<br><br>Before opening the dataset you need to know what the current working directory is, and change it to the directory in which you have saved the alcohol dataset csv file. The current working directory is basically the 'path' or the area of your computer that your jupyter notebook will be searching into. So it is important to specify which directory you want the jupyter to look in order to find the file you need to open.<br><br> A working directory should be in the format of: C:\Users\your_name\ ....<br><br><b>There are multiple different ways to open a dataset on python however, we would like you to focus on using pandas for now. There are 2 coding cells below. <br><br>In cell 1 can you:
<ol>
    <li>import os</li>
    <li>change the working directory to the directory you have saved the alcohol dataset in</li>
</ol>
<br>
In cell 2 can you:
    <ol>
        <li>use pandas to read the csv</li>
        <li>open the csv</li>
    </ol>
</b></p>

<p>------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------</p>
<h2>4.0 Cleaning the dataset</h2>

<h3>Renaming the columns</h3>
<p>The raw dataset looks a bit messy, and is definitely not in a state to appropriately work with. We have column headings with 'unnamed' which doesn't add any useful information, and we don't need long sentences for column headings when we're trying to manipulate the data. <br> To clean it up, we want to first remove the top heading row, and make row 0 the new set of headings. This will clearly set the data out to show values for each country in each year between 2000 and 2009. <br><br><b> In the cell below can you set the dataframe's column headings equal row 0?</b> Make sure to re-run the dataframe</p>

<p>Now that we have copied row 0 as the column headings, we need to remove row 0 from the dataset, making the current 'row 1' as the first row below the headings.</p><br>
<h4>Can you remove row 0 from the dataset?</h4>

<p> For some reason, the years in the column headings have an extra decimal point which is <i>slightly</i> unnecessary.</p> <br>
<h4><b> Could you update the headers so that the years 2000 - 2009 do not contain the extra decimal point?</b></h4>

<p> We also have a 'data source' column next to the country column, which we dont need.</p>
<br><h4>In the cell below, can you remove the data source column given its unneeded?</h4>

In [None]:
df=df.drop(['Data Source'], axis= 1)
df

<p>For some countries, some of the data values are written as 'NaN'. This is a shorthand for 'Not a Number', rather than a nanny! These values can't be interpreted, so ideally they need to be removed so that we are working with rows that have numbers only.<br></p>
<h4><b>In the cell below, could you update the dataset so all the rows that have NaN in them are removed?</b></h4>

In [None]:
df= df.dropna()
df

<h4>Now that we've gotten a cleaned overall dataset, we can create a new dataframe from it that focuses on the rows and columns that we want to work with specifically.</h4>

<p>For example, lets say, we only want to work with the data values for the beer beverage type and not the others. Or we only want to analyse the data for 'all beverages' rather than each of them separately. To make things clearer we can cut down the dataset and make a new dataframe containing <b>only</b> the information that we need.<br><br><b>In the cell below can you write some code to create a new dataframe (call it df_beer) that contains the rows showing data on dementia risk when consuming beer for all countries, through all the years?</p>

<p>We now have this condensed dataset, but the index (ie first column) are the row numbers, which we don't need. <br><br> <b>Can you write some code to make the 'Country' column the index/first column of the dataset instead?</b></p>

<p>------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------</p>

<h2>5.0 Transposing the dataset into Seaborn format</h2>

<p> Now we have a nicely cleaned, condensed and organised dataset that we can analyse. Seaborn is a python data visualisation library that allows you to produce really pretty, colourful statistical analyses ( via graphs, charts and other cool complex diagrams). <br><br>Theres an <a href= 'https://seaborn.pydata.org/examples/index.html'>example gallery</a> seaborn has on their website if you want to check out some graphs you could make! However for Seaborn to do its magic, we need to transpose the dataset (reformat it) to the way seaborn wants the datasets to be.<br><br> Luckily for us, its a very simple short piece of code! <br><br> <b> In the cell below can you transpose the dataset (df_beer) to make it compatible with Seaborn? </b></p>

<p>------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------</p>
<h2>6.0 Creating plots from the dataset</h2>
<p>Now that we have our table in seaborn format, we can get started with making some graphs with Seaborn!<br>Let's start off with making a line graph.<br><r> Alcohol consumption very much varies between countries (due to cultural and societal reasons mainly), and is one of the potential risk factors for developing dementia. <br><br><b>Can you create a line plot showing the trend of alcohol consumption on dementia risk in 4 countries of your choice within the dataset (df_beer), between 2000-2009?<br>If you are feeling confident, could you add a graph title, axis titles and a key (legend)?</b></p>

<p>We can also make bar charts from this data: <br><br><b>Can you make a bar chart showing the risk of dementia via each alcoholic beverage (excluding all types) in 2008 and 2009 (both in the same graph)?</b></p>

<h2>..and that's it for this workbook!</h2>
<h4> I hope you have enjoyed going through this section of the python course - being able to analyse data through code can really broaden your horizons when it comes to data visualisation, and can really help with making your research data (if you are to undertake research) much more appealing! </h4><br>
    <p>Video solutions to this workbook will be available on the website under section 7.</p>