<a href="https://colab.research.google.com/github/laurynbaldie/data-and-python/blob/main/Copy_of_07_Processing_data_from_data_files_with_lists.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Data in files

The most common source of data is in files.  One popular file format is CSV (comma separated values).  CSV files store data in table form with all the data items stored as text and separate by commas.  Data is organised with one record per row.

CSV is the smallest format to use, reducing storage space requirements and the time taken to transfer large files from one computer to another.

In this worksheet you will be given the code to read a table of data from a CSV file and will be given the column names.  Each column will hold a set of data of the same type (number, text, date).

You will be able to make lists following the example, and then use Python to get information (such as length, max, min, sum, average) or to print the list, or particular parts of the list.

---
[Video](https://vimeo.com/996732163/eb811ace14?share=copy)

---
### Read the file

This code will read the file and create a table from which you can create a lists, following the example.  

It uses a library called **pandas** which is built to work with large data files.  It is common to refer to pandas as pd to keep the amount of typing short.  

**Run the code to see what the data set looks like**.  As long as you have run this code cell, the data table will always be available lower down in the notebook and will always be called `dataset`.

In [None]:
import pandas as pd

def get_dataset():
  url = "https://raw.githubusercontent.com/futureCodersSE/working-with-data/main/Data%20sets/housing_in_london_yearly_variables.csv"
  return pd.read_csv(url)

dataset = get_dataset()
display(dataset)

---
### Get the list of areas

To create a list from a column use the code below (this code will get the area column, to get the `median_salary` or the `population_size` column, just replace `area` with an exact copy of the column header).  


In [None]:
areas = list(dataset['area'])
for area in areas:
  print(area)

---
### Exercise 1 - sort the areas list into alphabetical order

Write a function that will:  
*  Use the `.sort()` method to sort the areas into alphabetical order.  
*  Use a for loop to print the sorted list

In [None]:
def sort_areas():


sort_areas()

---
### Exercise 2 - create another list

Create a new list called **median_salaries**.  Print the `median_salaries` list, one item per line.  


---
### Exercise 3 - print statistics about the median salaries

Write a function that will:  

*  print the number of salaries in the list
*  create a variable called **largest**, assign it the largest salary in the list
*  print the value of `largest`
*  create a variable called **smallest**, assign it the smallest salary in the list
*  print the value of `smallest`
*  calculate and print the difference between `largest` and `smallest`
*  sort the `median_salaries` list into ascending order
*  calculate the `index`(or position) of the value in the middle of the list (as an integer)
*  print the item at that position in the list

**Expected output**
1071
61636.0
15684.0
45952.0
32681.0

In [None]:
def create_salary_stats():


create_salary_stats()

---
### Exercise 4 - create a population list

Create a new list called **population_sizes**.  Print the `population_sizes` list, one item per line.  

---
### Exercise 5 - print some statistics about population sizes

Write a function that will:

*  print the number of items in the `population_sizes` list
*  create a variable called **largest_population**, assign it the largest population size in the list
*  print the value of `largest_population`
*  create a variable called **smallest_population**, assign it the smallest population size in the list
*  print the value of `smallest_population`
*  calculate and print the difference between largest and smallest population
*  create a variable called **total** to hold the sum of the population_sizes list
*  calculate and print the average population per area
*  print the total

**Expected output**
1071
66435550.0
6581.0
66428969.0
nan

**Note:**  The last output 'nan' is stating that the sum is not a number (nan).  You may have noticed when you printed the list that some numbers were nan.  This is missing data and means that the sum function can't add the numbers up.  You will learn how to deal with this later on in the course.

In [None]:
def create_population_stats():


create_population_stats()

---
### CHALLENGE (optional)

From the exercises above, you know the largest and smallest population_size and you know the largest and smallest median_salary.

Write a function that will:
*  get new copies of the three lists (`areas`, `median_salaries`, `population_sizes` using the same code as before)
*  use the `.index()` function to get the `index` of the largest `median_salary`
*  print the `area` that is at this `index` in the `areas` list
*  use the `.index()` function to get the `index` of the smallest `population_size`
*  print the `area` that is at this `index` in the `areas` list

Are the areas the same?  

**Question**:  why wouldn't it be appropriate, with this dataset, to see if the area with the largest population_size had the lowest median_salary?  If you are not sure - amend the code below so that it does largest population and smallest median_salary.

In [None]:
def find_areas():


find_areas()

city of london
city of london


---
# Takeaways

*  we can use the pandas library to read data from an online CSV file and store it in a table
*  the table will have columns with headings and we can convert each column into a list
*  sometimes, data is incomplete and some statistics can't be calculate without cleaning up the data

---
# Your thoughts on what you have learnt  

Please add some comments in the box below to reflect on what you have learnt through completing this worksheet, and any problems you encountered while doing so.