# Introduction to Python Day 1
## Workbook Contents <a id = 'cont'></a>
1. [Example - Writing good code](#1)
2. [Example - Basic syntax](#2)
3. [Example - Script structure](#3)
4. [Example - The for loop](#4)
5. [Exercise - My first script](#5)
6. [Example - Built-in functions](#6)
7. [Example - Indexing and slicing](#7)
8. [Example - Iterators](#8)
9. [Example - String maniuplation](#9)
10. [Exercise - Fixing a list of strings](#10)
11. [Exercise - Importing data with pandas](#11)
12. [Example - Importing and exporting data with pandas](#12)
13. [Exercise - User defined functions](#13)
14. [Example - pandas: addressing, indexing, and masks](#14)
15. [Example - pandas: iterating and applying](#15)
16. [Example - pandas: grouping and aggregating](#16)
17. [Example - pandas: concatenating DataFrames](#17)
18. [Exercise - collecting and analysing separate data sources](#18)



## 1. Example - Writing good code<a id = '1'></a>
[Back to contents](#cont)

<b>Bad example:</b>

Note that this might be ok - if you're just trying to do something quickly and plan on throwing away the code.

In [1]:
x = 3.141*1.5**2
y = 2*3.14*1.5
z = 2*1.5
x2 = 4*3.14159265*1.5**3/3
print(x,y,z,x2)

7.06725 9.42 3.0 14.137166925


<b>Good example:</b>

Takes slightly longer to write - but is re-useable and could be picked up by someone else.<br>
What could be done even better?

In [2]:
# radius in metres
r = 1.5
# value of pi
PI = 3.14159265

# ** is to the power of
# area of circle according to A = pi*r^2
A = PI*r**2
# circumference of circle according to c = 2*pi*r
c = 2*PI*r
# diameter of circle according to d = 2r
d = 2*r
# volume of sphere according to V = 4*pi*r^3/3
V = 4*PI*r**3/3

print('Area is ',A)
print('Circumference is ',c)
print('Diameter is ',d)
print('Volume is ',V)

Area is  7.0685834625
Circumference is  9.424777950000001
Diameter is  3.0
Volume is  14.137166925


## 2. Example - Basic syntax<a id = '2'></a>
[Back to contents](#cont)

<b>Notice the key feautres:</b>
- different types of comment
- string variables vs integers / numeric
- use of keywords (e.g. <code>print</code>) and operators (e.g. <code>=</code>)

N.B. the result of adding strings is different than adding integers.

In [2]:
# define a string variable
a_string = '1'
# add two strings together and print them
print(a_string + a_string)

22


In [3]:
'''define a string variable
add two strings together
and print them'''

an_int = 1
print(an_int + an_int)

2


## 3. Example - Script Structure <a id = '3'></a>
[Back to contents](#cont)

<b>Read the below and feel free to modify.</b> <br>
N.B. the below has additional comments to explain - normally you wouldn't use this many.

In [6]:
# -*- coding: utf-8 -*-
"""
Created on Mon Sep  7 16:29:43 2020

@author: jasonboyle
"""
# script header above - not required but often auto-generated, notice triple quotes

# import modules required
import math

# define any user-defined functions here:

# def keyword introduces function
def my_function(): # function arguments are enclosed in brackets (no arguments here)
    # notice colon above, and notice that the line below the colon is indented
    # this function just prints a pre-defined string
    print('This is a function!')

# define another function that takes an integer argument    
def pow_2(an_int: int):
    # an_int is defined with local scope
    # do a simple calculation squaring the integer
    result_int = math.pow(an_int,2) # note that math.pow takes two arguments (base and exponent) separated by ,
    # note that this calculation can be performed as an_int**2 -it's the same
    
    # return the result of the calculation
    return result_int


# call the basic function to print our message
my_function()


# define the value of an integer - play around with this value to see what happens
my_int = 6 

# do a logical test
if my_int >= 5: #>= is greater than or equal to
    function_result = pow_2(my_int)
    print('Function result is: ', function_result) #notice print can take multiple arguments separated by ,
    
else:
    print('Integer is less than 5 - function not called')
    



This is a function!
Function result is:  36.0


The script is repeated here with minimal comments - it may be easier to see what's happening

In [7]:
# -*- coding: utf-8 -*-
"""
Created on Mon Sep  7 16:29:43 2020

@author: jasonboyle
"""

# import required modules
import math

# define functions
def my_function():
    print('This is a function!')

def pow_2(an_int: int):
    result_int = math.pow(an_int,2)
    return result_int

# call function and set int variable
my_function()
my_int = 6 

# do logical test for my_int more than 5 - calling pow_2 if succesful
if my_int >= 5:
    function_result = pow_2(my_int)
    print('Function result is: ', function_result)
    
else:
    print('Integer is less than 5 - function not called')
    



This is a function!
Function result is:  36.0


## 4. Example - the <code>for</code> loop <a id = '4'></a>
[Back to contents](#cont)

Using <code>range()</code>

In [8]:
for i in range(10):
    print(i, 'Hello world!')

0 Hello world!
1 Hello world!
2 Hello world!
3 Hello world!
4 Hello world!
5 Hello world!
6 Hello world!
7 Hello world!
8 Hello world!
9 Hello world!


<b>NB</b> - indexing in Python starts from 0. The first element is indexed as 0.

Iterating with a <code>for</code> loop

In [9]:
a_list = ['Sarah','Aqib','Jane','Tom']

for j in a_list:
    print(j)

Sarah
Aqib
Jane
Tom


## 5. Exercise - My First Script <a id ='5'></a>
[Back to contents](#cont)


<b>Write a script which does the following:</b>
- Stores 2 variables (one numeric, one string)
- Tests two logical conditions on the numeric variable
- Prints the string if logical conditions are met

Remember to use <code>if condition is true:</code>… and that <code>:</code> should be followed by an indent on the next line

In [1]:
# Enter your script here


4


## 6. Example - key built-in functions<a id= '6'></a>
[Back to contents](#cont)

<code>print()</code>  - print to screen either a direct value or the value of a variable

In [13]:
print('Hello World!')
my_var = 'a_long_string'
print(my_var)

Hello World!
a_long_string


String formatting with <code>%</code> - this acts like a placeholder with optional format strings

In [17]:
an_int = 100
a_float = 445.68868

print('Here is an integer %i' %an_int)
print('Here is a float %f' %a_float)
print('Here are two presentations of the same float %4.3f   %3.0f' %(a_float,a_float))

Here is an integer 100
Here is a float 445.688680
Here are two presentations of the same float 445.689   446


<code>max()</code> <code>min()</code>  - max / min of multiple values

In [18]:
max([0,3,5,2])

5

In [19]:
min([0,3,5,2])

0

<code>len()</code> - length of an object

In [20]:
len(['this','is','a','list','of','strings'])

6

In [21]:
len('arbitrary_string')

16

<code>str()</code> <code>int()</code> <code>float()</code> etc.  - convert data types

In [22]:
str(1)

'1'

In [23]:
int('1')


1

In [24]:
float(1)

1.0

## 7. Example - Indexing and slicing<a id = '7'></a>
[Back to contents](#cont)

Create a <code>list</code>

In [25]:
a = ['zero','one','two','three','four','five','six','seven','eight','nine','ten']

Zero-based indexing

In [26]:
a[3]

'three'

Negative indices are an offset from the end of the sequence.

In [27]:
a[-3]

'eight'

This is known as a slice. Note that the lower limit is inclusive, while the upper limit it exclusive.

In [28]:
a[3:6]

['three', 'four', 'five']

Yes, it's empty: you reached the end before you started!

In [29]:
a[6:3]

[]

Leaving out a limit means "go as far as you can in that direction".

In [30]:
a[:6]

['zero', 'one', 'two', 'three', 'four', 'five']

In [31]:
a[6:]

['six', 'seven', 'eight', 'nine', 'ten']

A third number gives the step size. It can be negative. 

In [32]:
a[0:6:2]

['zero', 'two', 'four']

## 8. Example - Iterators<a id = '8'></a>
[Back to contents](#cont)

Using the list from before:

In [38]:
a = ['zero','one','two','three','four','five','six','seven','eight','nine','ten']

Using a manually constructed <code>for</code> loop - we've got the start and end points wrong so have missed 'zero' and got an error as we tried to access an element that doesn't exist.

In [39]:
for i in range(1,12):
    print(a[i])

one
two
three
four
five
six
seven
eight
nine
ten


IndexError: list index out of range

Let's add a new value to the list

In [40]:
a.append('eleven')
a

['zero',
 'one',
 'two',
 'three',
 'four',
 'five',
 'six',
 'seven',
 'eight',
 'nine',
 'ten',
 'eleven']

Using <code>for</code>

In [41]:
for i in a:
    print(i)

zero
one
two
three
four
five
six
seven
eight
nine
ten
eleven


Using <code>enumerate</code>

In [42]:
for i, val in enumerate(a):
    print(i, val)

0 zero
1 one
2 two
3 three
4 four
5 five
6 six
7 seven
8 eight
9 nine
10 ten
11 eleven


With a slice to print the final 3 values from high to low

In [43]:
for i, val in enumerate(a[-1:-4:-1]):
    print(i, val)

0 eleven
1 ten
2 nine


## 9. Example - string manipulation<a id = '9'></a>
[Back to contents](#cont)

Define a string (datatype <code>str</code>)

In [44]:
my_string = 'WBS-1533271'

Slicing works with strings too! As strings are just arrays of characters

In [45]:
my_string[4:8]

'1533'

Stack a slice to perform two concurrent slices

In [46]:
my_string[:6][-2:]

'15'

Flip the string (<code>::</code> indicates the whole string, <code>-1</code> indicates in reverse)

In [47]:
my_string[::-1]

'1723351-SBW'

Split the string based on a particular character using <code>.split()</code> - returns a <code>list</code>

In [48]:
my_string.split('-')

['WBS', '1533271']

Count occurences of a particular character using <code>.count()</code>

In [49]:
my_string.count('3')

2

More powerful string searching with <code>regex</code>:

In [2]:
import re

string1 = 'The rain in Spain'
string2 = 'The dog in Spain'
string3 = 'The cat in spain'

re.search('^The.*Spain$', string1) 


<re.Match object; span=(0, 17), match='The rain in Spain'>

In [3]:
re.search('^The.*Spain$', string2) 

<re.Match object; span=(0, 16), match='The dog in Spain'>

In [5]:
re.search('^The.*Spain$', string3) 

## 10. Exercise - Fixing a list of strings<a id = '10'></a>
[Back to contents](#cont)

Starting with a <code>list</code> of strings below these have the following characteristics:
- Each string has a person's 'spirit animal' and their full name
- Animals are separated by '_' from the name
- Names are in order first then last, separated by a space
- Animals have accidentally been entered in reverse order
- There are some blank entries which we're not interested in
Example: John Smith's spirit animal is a dog - the string would be 'god_John Smith'

<b>Requirement</b>
1. Split the strings into three separate lists first name, last name, spirit animal
2. The animal list should contain animals the right way round
3. Some strings in the list are blank - get rid of these, they shouldn't show up in the final lists
4. BONUS - get a list of animals which are 'subspecies' these have commas in the animal name: e.g. 'dog, labrador'
5. BONUS - create a new list of strings which recombines and expands on the separate lists in a useful way: e.g. 'John Smith's spirit animal is a dog'
6. BONUS - count the number of subspecies

<b>HINT</b>

- Think about how you can fix one string on its own, then how can you extend this to a list, creating multiple results lists.
- You will need to combine what we have learned about strings, iterators, indexing, and lists

In [2]:
string_list = ['retaw ,naageL_Cacilie Lenahan', '', 'ylzzirg ,raeB_Crichton Comelini',
'dekcen-ylloow ,krotS_Cosetta Micallef', 'nainosduh ,tiwdoG_Dolly Dyer',
'esuorg egas retaerG_Meaghan Abbett', 'enipucrop naciremA htroN_Banky Lukash',
'alaoK_Gian Yarranton', 'retto revir naciremA htroN_Ariel Brett', '',
'elgae ynwaT_Matthus Sellner', 'ooragnak decaf-kcalB_Ethel De Bernardi',
'eulb ,kcocaeP_Paulie Reddecliffe', 'esiotrot treseD_Bastien Unwin', '',
'gohegdeh nacirfA htuoS_Ambrosio Heamus', '', '', 'nrehtuos ,gniwpaL_Jobey Rosini',
'yraccep deppil-etihW_Libbi Jeske', 'yballaw eligA_Arabela MacManus',
'ekans recaR_Wilfrid Brownrigg', 'kcalb ,nawS_Delinda Broschke',
'dedaeh-wolley ,aracaraC_Robbie Kittman', 'llibnepo ,krotS_Gaspar Butchers']

# Write your solution here
string_list

['retaw ,naageL_Cacilie Lenahan',
 '',
 'ylzzirg ,raeB_Crichton Comelini',
 'dekcen-ylloow ,krotS_Cosetta Micallef',
 'nainosduh ,tiwdoG_Dolly Dyer',
 'esuorg egas retaerG_Meaghan Abbett',
 'enipucrop naciremA htroN_Banky Lukash',
 'alaoK_Gian Yarranton',
 'retto revir naciremA htroN_Ariel Brett',
 '',
 'elgae ynwaT_Matthus Sellner',
 'ooragnak decaf-kcalB_Ethel De Bernardi',
 'eulb ,kcocaeP_Paulie Reddecliffe',
 'esiotrot treseD_Bastien Unwin',
 '',
 'gohegdeh nacirfA htuoS_Ambrosio Heamus',
 '',
 '',
 'nrehtuos ,gniwpaL_Jobey Rosini',
 'yraccep deppil-etihW_Libbi Jeske',
 'yballaw eligA_Arabela MacManus',
 'ekans recaR_Wilfrid Brownrigg',
 'kcalb ,nawS_Delinda Broschke',
 'dedaeh-wolley ,aracaraC_Robbie Kittman',
 'llibnepo ,krotS_Gaspar Butchers']

## 11. Exercise - importing data with pandas<a id = '11'></a> 
[Back to contents](#cont)

<b>Requirement</b>
1. Read in <code>.csv</code> data from the following location <code>C:\Users\student\Desktop\Python Training\MOCK_DATA.csv</code>

<b>Key pandas functions for import and export:</b>
- <code>pd.read_csv(a_path)</code>
- <code>pd.read_excel(a_path)</code>
- <code>pd.to_csv(a_path)</code>
- <code>pd.to_excel(a_path)</code>

In [53]:
import pandas as pd
# your code goes here

## 12. Example - importing and exporting data with pandas<a id = '12'></a> 
[Back to contents](#cont)

Continuing from the example DataFrame above - let's check the columns

In [146]:
my_df = pd.read_csv(data_path)
my_df.columns.values

array(['id', 'first_name', 'last_name', 'email', 'gender', 'ip_address',
       'salary', 'date_of_birth'], dtype=object)


Let's do some data manipulation

Create a new column for full name

In [147]:
my_df['full_name'] = my_df['first_name'] + ' ' + my_df['last_name']
my_df

Unnamed: 0,id,first_name,last_name,email,gender,ip_address,salary,date_of_birth,full_name
0,1,Graham,Ivanenkov,givanenkov0@uiuc.edu,Male,12.58.104.199,102778,11/06/1969,Graham Ivanenkov
1,2,Karlis,Ballchin,kballchin1@alibaba.com,Male,128.63.193.159,70340,10/04/1969,Karlis Ballchin
2,3,Moria,Barber,mbarber2@oaic.gov.au,Female,175.165.231.76,28931,05/03/1950,Moria Barber
3,4,Leighton,Quinnette,lquinnette3@biblegateway.com,Male,21.218.156.217,20684,11/03/1985,Leighton Quinnette
4,5,Trenton,Farmiloe,tfarmiloe4@webmd.com,Male,223.133.70.216,106946,09/12/1951,Trenton Farmiloe
...,...,...,...,...,...,...,...,...,...
995,996,Innis,Lindell,ilindellrn@privacy.gov.au,Male,33.228.210.6,26023,06/02/1970,Innis Lindell
996,997,Edin,Gajewski,egajewskiro@privacy.gov.au,Female,7.180.8.157,18719,26/09/1971,Edin Gajewski
997,998,Seymour,Boobier,sboobierrp@noaa.gov,Male,105.196.207.20,22734,28/07/1999,Seymour Boobier
998,999,Gerick,Seyers,gseyersrq@unc.edu,Male,220.22.200.1,85659,10/11/1950,Gerick Seyers


Get rid of the ip adress column using <code>drop</code> as it's surplus

In [148]:
my_df = my_df.drop('ip_address', axis = 1)
my_df

Unnamed: 0,id,first_name,last_name,email,gender,salary,date_of_birth,full_name
0,1,Graham,Ivanenkov,givanenkov0@uiuc.edu,Male,102778,11/06/1969,Graham Ivanenkov
1,2,Karlis,Ballchin,kballchin1@alibaba.com,Male,70340,10/04/1969,Karlis Ballchin
2,3,Moria,Barber,mbarber2@oaic.gov.au,Female,28931,05/03/1950,Moria Barber
3,4,Leighton,Quinnette,lquinnette3@biblegateway.com,Male,20684,11/03/1985,Leighton Quinnette
4,5,Trenton,Farmiloe,tfarmiloe4@webmd.com,Male,106946,09/12/1951,Trenton Farmiloe
...,...,...,...,...,...,...,...,...
995,996,Innis,Lindell,ilindellrn@privacy.gov.au,Male,26023,06/02/1970,Innis Lindell
996,997,Edin,Gajewski,egajewskiro@privacy.gov.au,Female,18719,26/09/1971,Edin Gajewski
997,998,Seymour,Boobier,sboobierrp@noaa.gov,Male,22734,28/07/1999,Seymour Boobier
998,999,Gerick,Seyers,gseyersrq@unc.edu,Male,85659,10/11/1950,Gerick Seyers


Lets order the records alphabetically using <code>sort_values</code>

In [149]:
my_df = my_df.sort_values('full_name', axis = 0, ascending = True)
my_df

Unnamed: 0,id,first_name,last_name,email,gender,salary,date_of_birth,full_name
554,555,Abbe,Aucott,aaucottfe@ft.com,Female,14325,29/01/1962,Abbe Aucott
477,478,Abbi,Southwood,asouthwoodd9@businessinsider.com,Female,27059,24/05/1983,Abbi Southwood
214,215,Abie,Saenz,asaenz5y@webnode.com,Male,44571,07/01/1974,Abie Saenz
752,753,Addie,Gregorace,agregoracekw@gnu.org,Male,28902,07/06/1987,Addie Gregorace
79,80,Adeline,Gaskell,agaskell27@freewebs.com,Female,57393,11/10/1962,Adeline Gaskell
...,...,...,...,...,...,...,...,...
584,585,Zachariah,De Bruyne,zdebruyneg8@linkedin.com,Male,45348,19/10/1999,Zachariah De Bruyne
620,621,Zak,Furnival,zfurnivalh8@acquirethisname.com,Male,65818,16/09/1980,Zak Furnival
476,477,Zak,Manchester,zmanchesterd8@gov.uk,Male,11353,26/07/1975,Zak Manchester
234,235,Zelig,Russan,zrussan6i@hhs.gov,Male,73397,05/04/1975,Zelig Russan


Also get rid of the old name columns as we no longer need these - this time we pas a <code>list</code> of multiple columns to <code>drop</code>

In [150]:
my_df = my_df.drop(['first_name','last_name'],axis = 1)
my_df

Get the columns using the <code>columns</code> accessor

In [151]:
old_columns = my_df.columns
old_columns

Index(['id', 'email', 'gender', 'salary', 'date_of_birth', 'full_name'], dtype='object')

This produces a <code>pandas</code> <code>index</code> which can be tricky to work with - let's convert to a <code>list</code> instead


In [152]:
old_columns = list(old_columns.values)
old_columns

['id', 'email', 'gender', 'salary', 'date_of_birth', 'full_name']

Now mess around with our <code>list</code> using slices to reorder it note <code>old_columns[-1]</code> is a string - so wrapping it in <code>[]</code> converts it into a list so we can add the two lists together using <code>+</code>

In [153]:
new_columns = [old_columns[-1]] + old_columns[:-1]
new_columns

['full_name', 'id', 'email', 'gender', 'salary', 'date_of_birth']

Now reorder the <code>DataFrame</code> columns

In [154]:
my_df = my_df[new_columns]
my_df

Unnamed: 0,full_name,id,email,gender,salary,date_of_birth
554,Abbe Aucott,555,aaucottfe@ft.com,Female,14325,29/01/1962
477,Abbi Southwood,478,asouthwoodd9@businessinsider.com,Female,27059,24/05/1983
214,Abie Saenz,215,asaenz5y@webnode.com,Male,44571,07/01/1974
752,Addie Gregorace,753,agregoracekw@gnu.org,Male,28902,07/06/1987
79,Adeline Gaskell,80,agaskell27@freewebs.com,Female,57393,11/10/1962
...,...,...,...,...,...,...
584,Zachariah De Bruyne,585,zdebruyneg8@linkedin.com,Male,45348,19/10/1999
620,Zak Furnival,621,zfurnivalh8@acquirethisname.com,Male,65818,16/09/1980
476,Zak Manchester,477,zmanchesterd8@gov.uk,Male,11353,26/07/1975
234,Zelig Russan,235,zrussan6i@hhs.gov,Male,73397,05/04/1975


If we'd used <code>my_df.columns = new_cols</code> we would have just renamed the existing columns without moving them - not what we wanted!

In [155]:
#e.g. put back to the way it was
test_df = my_df[old_columns]
# now rename
test_df.columns = new_columns
test_df

Unnamed: 0,full_name,id,email,gender,salary,date_of_birth
554,555,aaucottfe@ft.com,Female,14325,29/01/1962,Abbe Aucott
477,478,asouthwoodd9@businessinsider.com,Female,27059,24/05/1983,Abbi Southwood
214,215,asaenz5y@webnode.com,Male,44571,07/01/1974,Abie Saenz
752,753,agregoracekw@gnu.org,Male,28902,07/06/1987,Addie Gregorace
79,80,agaskell27@freewebs.com,Female,57393,11/10/1962,Adeline Gaskell
...,...,...,...,...,...,...
584,585,zdebruyneg8@linkedin.com,Male,45348,19/10/1999,Zachariah De Bruyne
620,621,zfurnivalh8@acquirethisname.com,Male,65818,16/09/1980,Zak Furnival
476,477,zmanchesterd8@gov.uk,Male,11353,26/07/1975,Zak Manchester
234,235,zrussan6i@hhs.gov,Male,73397,05/04/1975,Zelig Russan


Let's filter the data for people with google email addresses only

In [156]:
# get the subset of data with google email addresses (boolean mask)
my_mask = my_df['email'].str.contains('google')

# filter the dataframe
google_df = my_df[my_mask]
google_df

Unnamed: 0,full_name,id,email,gender,salary,date_of_birth
534,Adrienne MacConnal,535,amacconnaleu@google.com.hk,Female,94610,18/09/1950
695,Alta Blackhurst,696,ablackhurstjb@google.com.au,Female,14482,06/10/1982
757,Ber Bunner,758,bbunnerl1@google.de,Male,58939,30/01/1993
888,Cam Scrowson,889,cscrowsonoo@google.it,Female,107027,03/05/1998
912,Cassandry Agge,913,caggepc@google.com.hk,Female,76660,10/05/1964
852,Constanta Antonomolii,853,cantonomoliino@google.com.br,Female,72382,20/10/1995
561,Douglass Putland,562,dputlandfl@google.es,Male,38509,28/02/1954
382,Elaine Dowbakin,383,edowbakinam@google.ca,Female,57259,29/06/1989
446,Eugenio Brazener,447,ebrazenerce@google.ca,Male,53693,17/04/1947
64,Gabriell Slinger,65,gslinger1s@google.com.hk,Female,59909,27/08/1953


We can do all of the above manipulation in a few lines if we know what we're doing...

In [157]:
my_df = pd.read_csv(data_path)
my_df['full_name'] = my_df['first_name'] + ' ' + my_df['last_name']
my_df = my_df.drop(['ip_address','first_name','last_name'], axis = 1)
cols = list(my_df.columns.values)
google_df = my_df[my_df[[cols[-1]]+cols[:-1]]['email'].str.contains('google')].sort_values(by='full_name', ascending = True)
google_df

Unnamed: 0,id,email,gender,salary,date_of_birth,full_name
534,535,amacconnaleu@google.com.hk,Female,94610,18/09/1950,Adrienne MacConnal
695,696,ablackhurstjb@google.com.au,Female,14482,06/10/1982,Alta Blackhurst
757,758,bbunnerl1@google.de,Male,58939,30/01/1993,Ber Bunner
888,889,cscrowsonoo@google.it,Female,107027,03/05/1998,Cam Scrowson
912,913,caggepc@google.com.hk,Female,76660,10/05/1964,Cassandry Agge
852,853,cantonomoliino@google.com.br,Female,72382,20/10/1995,Constanta Antonomolii
561,562,dputlandfl@google.es,Male,38509,28/02/1954,Douglass Putland
382,383,edowbakinam@google.ca,Female,57259,29/06/1989,Elaine Dowbakin
446,447,ebrazenerce@google.ca,Male,53693,17/04/1947,Eugenio Brazener
64,65,gslinger1s@google.com.hk,Female,59909,27/08/1953,Gabriell Slinger


Finally let's put the data in a new excel file

First let's decide on a new file name and path

In [159]:
import os
# combine path with new filename
new_path = os.path.join(os.path.dirname(data_path),'MOCK_DATA_modified.xls')
new_path

'C:\\Users\\student\\desktop\\Python Training\\MOCK_DATA_modified.xls'

Or more generically for an arbitrary filename and path we could just append <code>_modified.xls</code> to whatever the source data file was called

In [161]:
head, tail = os.path.split(data_path)
new_path = os.path.join(head, os.path.splitext(tail)[0] + '_modified.xls')  
new_path

'C:\\Users\\student\\desktop\\Python Training\\MOCK_DATA_modified.xls'

Output to an Excel file

In [163]:
google_df.to_excel(new_path, index = False)
print('Modified file outputted to %s' %new_path)

Modified file outputted to C:\Users\student\desktop\Python Training\MOCK_DATA_modified.xls


Load it back just to check

In [164]:
pd.read_excel(new_path)

Unnamed: 0,full_name,id,email,gender,salary,date_of_birth
0,Adrienne MacConnal,535,amacconnaleu@google.com.hk,Female,94610,18/09/1950
1,Alta Blackhurst,696,ablackhurstjb@google.com.au,Female,14482,06/10/1982
2,Ber Bunner,758,bbunnerl1@google.de,Male,58939,30/01/1993
3,Cam Scrowson,889,cscrowsonoo@google.it,Female,107027,03/05/1998
4,Cassandry Agge,913,caggepc@google.com.hk,Female,76660,10/05/1964
5,Constanta Antonomolii,853,cantonomoliino@google.com.br,Female,72382,20/10/1995
6,Douglass Putland,562,dputlandfl@google.es,Male,38509,28/02/1954
7,Elaine Dowbakin,383,edowbakinam@google.ca,Female,57259,29/06/1989
8,Eugenio Brazener,447,ebrazenerce@google.ca,Male,53693,17/04/1947
9,Gabriell Slinger,65,gslinger1s@google.com.hk,Female,59909,27/08/1953


## 13. Exercise - user defined function <a id = '13'></a>
[Back to contents](#cont)

Recall the code for calculating circle and sphere geometry from a radius:<br>
<code>PI = 3.14159265
d = 2\*r
c = 2\*PI\*r
A = PI\*r\*\*2
V = 4\*PI\*r\*\*3/3</code>

<b>Requirement</b>
1. Create a user defined function which returns <code>d</code>,<code>c</code>,<code>A</code>,<code>V</code> as a <code>tuple</code>.
2. BONUS take user input for the radius by using <code>r = input()</code>

You can use the <code>math</code> module to import PI as a constant rather than hard-coding it

HINT: your code should include <code>def</code>...

In [None]:
#type your code in here using the starting points below

from math import pi

#define function here
    # calculations here
#    return # return values here

# call the function here

## 14. Example - pandas: addressing, indexing, and masks<a id = '14'></a>
[Back to contents](#cont)

Let's create a DataFrame from scratch using lists

In [3]:
import pandas as pd

animal_list = ['dog','cat','rabbit','mouse','guinea pig','goldfish']
count_list = [7,5,3,1,1,5]

pd.DataFrame(list(zip(animal_list,count_list)), columns = ['animals', 'counts'])

Unnamed: 0,animals,counts
0,dog,7
1,cat,5
2,rabbit,3
3,mouse,1
4,guinea pig,1
5,goldfish,5


Equivalently using a dictionary rather than lists

In [4]:
data_dict = {'animals' : ['dog','cat','rabbit','mouse','guinea pig','goldfish'], 'counts' :[7,5,3,1,1,5]}
df = pd.DataFrame(data_dict)
df

Unnamed: 0,animals,counts
0,dog,7
1,cat,5
2,rabbit,3
3,mouse,1
4,guinea pig,1
5,goldfish,5


Note that the pandas index may not be numeric, and could be made of strings for example.

Using list comprehension (a compact <code>for</code> loop) we can create a new index list

In [5]:
new_index = ['row %i' %i for i in range(len(df))]
new_index

['row 0', 'row 1', 'row 2', 'row 3', 'row 4', 'row 5']

Let's change the index

In [6]:
df.index = new_index
df

Unnamed: 0,animals,counts
row 0,dog,7
row 1,cat,5
row 2,rabbit,3
row 3,mouse,1
row 4,guinea pig,1
row 5,goldfish,5


Addressing columns as attributes produces a pandas series

In [None]:
df.animals

Similarly using string column names to address columns

In [7]:
df['animals']

row 0           dog
row 1           cat
row 2        rabbit
row 3         mouse
row 4    guinea pig
row 5      goldfish
Name: animals, dtype: object

Reommended approach for pandas index and column indexing is <code>.loc[index,column]</code>

In [None]:
df.loc[:,'animals']

Also recommended using integer indexing <code>.iloc[row number, column number]</code>

In [None]:
df.iloc[:,0]

For specific rows and columns - we can slice as per ususal

In [None]:
df.iloc[0:3, 0]

In [None]:
df.loc['row 2', 'animals']

We can use logical tests to develop a 'Boolean mask' of a DataFrame

In [8]:
bool_mask = df['counts']>2
bool_mask

row 0     True
row 1     True
row 2     True
row 3    False
row 4    False
row 5     True
Name: counts, dtype: bool

We can then apply the mask to the DataFrame giving a reduced portion of the DataFrame which meets the logical criteria

In [9]:
df[bool_mask]

Unnamed: 0,animals,counts
row 0,dog,7
row 1,cat,5
row 2,rabbit,3
row 5,goldfish,5


## 15. Example - pandas: iterating and applying<a id='15'></a>
[Back to contents](#cont)

Let's use the data from earlier to test some more techniques

In [6]:
import pandas as pd

# define file path
data_path = r'C:\Users\student\Desktop\Python Training\MOCK_DATA.csv'

# read in - data is now stored in a dataframe in python
my_df = pd.read_csv(data_path)

my_df

FileNotFoundError: [Errno 2] File C:\Users\student\Desktop\Python Training\MOCK_DATA.csv does not exist: 'C:\\Users\\student\\Desktop\\Python Training\\MOCK_DATA.csv'

Iterate over columns to perform operations on each column (pandas Series)

In [None]:
for column in my_df:
    print(column, min(my_df[column]), max(my_df[column]))

This is equivalent to using the <code>iteritems()</code> iterator - which  returns a tuple of columns and values. This is the recommended method for column iteration

In [None]:
for column, values in my_df.iteritems():
    print(column, min(values), max(values))

The results for min and max date of birth don't look right. These dates have been wrongly imported as strings. Let's fix it

In [12]:
my_df['date_of_birth'] = pd.to_datetime(my_df['date_of_birth'])
my_df

Unnamed: 0,id,first_name,last_name,email,gender,ip_address,salary,date_of_birth
0,1,Graham,Ivanenkov,givanenkov0@uiuc.edu,Male,12.58.104.199,102778,1969-11-06
1,2,Karlis,Ballchin,kballchin1@alibaba.com,Male,128.63.193.159,70340,1969-10-04
2,3,Moria,Barber,mbarber2@oaic.gov.au,Female,175.165.231.76,28931,1950-05-03
3,4,Leighton,Quinnette,lquinnette3@biblegateway.com,Male,21.218.156.217,20684,1985-11-03
4,5,Trenton,Farmiloe,tfarmiloe4@webmd.com,Male,223.133.70.216,106946,1951-09-12
...,...,...,...,...,...,...,...,...
995,996,Innis,Lindell,ilindellrn@privacy.gov.au,Male,33.228.210.6,26023,1970-06-02
996,997,Edin,Gajewski,egajewskiro@privacy.gov.au,Female,7.180.8.157,18719,1971-09-26
997,998,Seymour,Boobier,sboobierrp@noaa.gov,Male,105.196.207.20,22734,1999-07-28
998,999,Gerick,Seyers,gseyersrq@unc.edu,Male,220.22.200.1,85659,1950-10-11


To perform operations on a row-by-row basis - we could iterate, but this is slow and inefficienct in pandas.

E.g. let's calculate age based on current date

In [13]:
from datetime import datetime
age_list = []

#iterate over rows
for row in my_df['date_of_birth']:
    datediff_days = (datetime.now() - row).days
    age_list.append((datediff_days/365))

my_df['age'] = age_list
my_df

Unnamed: 0,id,first_name,last_name,email,gender,ip_address,salary,date_of_birth,age
0,1,Graham,Ivanenkov,givanenkov0@uiuc.edu,Male,12.58.104.199,102778,1969-11-06,50.882192
1,2,Karlis,Ballchin,kballchin1@alibaba.com,Male,128.63.193.159,70340,1969-10-04,50.972603
2,3,Moria,Barber,mbarber2@oaic.gov.au,Female,175.165.231.76,28931,1950-05-03,70.408219
3,4,Leighton,Quinnette,lquinnette3@biblegateway.com,Male,21.218.156.217,20684,1985-11-03,34.879452
4,5,Trenton,Farmiloe,tfarmiloe4@webmd.com,Male,223.133.70.216,106946,1951-09-12,69.046575
...,...,...,...,...,...,...,...,...,...
995,996,Innis,Lindell,ilindellrn@privacy.gov.au,Male,33.228.210.6,26023,1970-06-02,50.312329
996,997,Edin,Gajewski,egajewskiro@privacy.gov.au,Female,7.180.8.157,18719,1971-09-26,48.994521
997,998,Seymour,Boobier,sboobierrp@noaa.gov,Male,105.196.207.20,22734,1999-07-28,21.139726
998,999,Gerick,Seyers,gseyersrq@unc.edu,Male,220.22.200.1,85659,1950-10-11,69.967123


We can use <code>apply</code> to apply a function to a column without iterating.

Let's use the <code>numpy</code> library which includes a range of vectorized mathemtaical functions.

Specifically we can use <code>np.floor</code> to round down our ages then convert to <code>int</code.

In [14]:
import numpy as np
# round down and convert
my_df['age'] = my_df['age'].apply(np.floor).astype(int)
my_df

Unnamed: 0,id,first_name,last_name,email,gender,ip_address,salary,date_of_birth,age
0,1,Graham,Ivanenkov,givanenkov0@uiuc.edu,Male,12.58.104.199,102778,1969-11-06,50
1,2,Karlis,Ballchin,kballchin1@alibaba.com,Male,128.63.193.159,70340,1969-10-04,50
2,3,Moria,Barber,mbarber2@oaic.gov.au,Female,175.165.231.76,28931,1950-05-03,70
3,4,Leighton,Quinnette,lquinnette3@biblegateway.com,Male,21.218.156.217,20684,1985-11-03,34
4,5,Trenton,Farmiloe,tfarmiloe4@webmd.com,Male,223.133.70.216,106946,1951-09-12,69
...,...,...,...,...,...,...,...,...,...
995,996,Innis,Lindell,ilindellrn@privacy.gov.au,Male,33.228.210.6,26023,1970-06-02,50
996,997,Edin,Gajewski,egajewskiro@privacy.gov.au,Female,7.180.8.157,18719,1971-09-26,48
997,998,Seymour,Boobier,sboobierrp@noaa.gov,Male,105.196.207.20,22734,1999-07-28,21
998,999,Gerick,Seyers,gseyersrq@unc.edu,Male,220.22.200.1,85659,1950-10-11,69


We can repeat this whole calculation process more efficiently, using <code>apply</code> with a disposable <code>lambda</code> function

In [15]:
my_df['age'] = my_df.apply(lambda row: int(np.floor((datetime.now() - row['date_of_birth']).days/365)) , axis =1)
my_df

Unnamed: 0,id,first_name,last_name,email,gender,ip_address,salary,date_of_birth,age
0,1,Graham,Ivanenkov,givanenkov0@uiuc.edu,Male,12.58.104.199,102778,1969-11-06,50
1,2,Karlis,Ballchin,kballchin1@alibaba.com,Male,128.63.193.159,70340,1969-10-04,50
2,3,Moria,Barber,mbarber2@oaic.gov.au,Female,175.165.231.76,28931,1950-05-03,70
3,4,Leighton,Quinnette,lquinnette3@biblegateway.com,Male,21.218.156.217,20684,1985-11-03,34
4,5,Trenton,Farmiloe,tfarmiloe4@webmd.com,Male,223.133.70.216,106946,1951-09-12,69
...,...,...,...,...,...,...,...,...,...
995,996,Innis,Lindell,ilindellrn@privacy.gov.au,Male,33.228.210.6,26023,1970-06-02,50
996,997,Edin,Gajewski,egajewskiro@privacy.gov.au,Female,7.180.8.157,18719,1971-09-26,48
997,998,Seymour,Boobier,sboobierrp@noaa.gov,Male,105.196.207.20,22734,1999-07-28,21
998,999,Gerick,Seyers,gseyersrq@unc.edu,Male,220.22.200.1,85659,1950-10-11,69


This is better - but the syntax is a little messy, and we haven't accounted for leap-years. We really need to define a custom function for this and <code>apply</code> that instead.

Define and test the function:

In [17]:
def calc_age(dob: datetime):
    # current year - birth year
    current = datetime.now()
    year_delta = current.year - dob.year
    month_delta = current.month - dob.month
    day_delta = current.day - dob.day
    
    if current.month > dob.month or (current.month == dob.month and current.day >= dob.day):
        age = year_delta
    else:
        age = year_delta - 1
    return (age)
    
    
print(my_df.loc[0,'date_of_birth'])
print(calc_age(my_df.loc[0,'date_of_birth']))
    

1969-11-06 00:00:00
50


Now let's <code>apply</code> the <code>calc_age</code> function to the <code>'date of birth'</code> <code>series</code>:

In [18]:
my_df['age'] = my_df['date_of_birth'].apply(calc_age)
my_df

Unnamed: 0,id,first_name,last_name,email,gender,ip_address,salary,date_of_birth,age
0,1,Graham,Ivanenkov,givanenkov0@uiuc.edu,Male,12.58.104.199,102778,1969-11-06,50
1,2,Karlis,Ballchin,kballchin1@alibaba.com,Male,128.63.193.159,70340,1969-10-04,50
2,3,Moria,Barber,mbarber2@oaic.gov.au,Female,175.165.231.76,28931,1950-05-03,70
3,4,Leighton,Quinnette,lquinnette3@biblegateway.com,Male,21.218.156.217,20684,1985-11-03,34
4,5,Trenton,Farmiloe,tfarmiloe4@webmd.com,Male,223.133.70.216,106946,1951-09-12,68
...,...,...,...,...,...,...,...,...,...
995,996,Innis,Lindell,ilindellrn@privacy.gov.au,Male,33.228.210.6,26023,1970-06-02,50
996,997,Edin,Gajewski,egajewskiro@privacy.gov.au,Female,7.180.8.157,18719,1971-09-26,48
997,998,Seymour,Boobier,sboobierrp@noaa.gov,Male,105.196.207.20,22734,1999-07-28,21
998,999,Gerick,Seyers,gseyersrq@unc.edu,Male,220.22.200.1,85659,1950-10-11,69


Success!

## 16. Example - pandas: grouping and aggregating<a id = '16'></a>
[Back to contents](#cont)

Using the <code>.groupby</code> method we can group similar entries together. Grouping by gender:

In [20]:
my_df.groupby('gender')

<pandas.core.groupby.generic.DataFrameGroupBy object at 0x000001E5DCD8CE88>

Now let's add some custom grouping using a <code>mask</code> and the <code>numpy</code> <code>where</code> function

In [22]:
mask = my_df['age']<50
mask = np.where(mask==True,'Less than 50', '50 or older')

grouped_df = my_df.groupby(by=['gender',mask])
grouped_df

<pandas.core.groupby.generic.DataFrameGroupBy object at 0x000001E5DCD7E808>

Now we can aggregate the data by functions of our choosing. Let's check the columns first

In [None]:
my_df.columns

Now let's aggregate

In [23]:
# use a dict for ease
agg_dict = {'id': 'count', 'age': 'mean', 'salary': ['min', 'max' ,'std']}
agg_df = grouped_df.agg(agg_dict)
agg_df

Unnamed: 0_level_0,Unnamed: 1_level_0,id,age,salary,salary,salary
Unnamed: 0_level_1,Unnamed: 1_level_1,count,mean,min,max,std
gender,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2
Female,50 or older,203,62.374384,10480,109916,29853.819621
Female,Less than 50,275,33.883636,10509,109983,29436.61377
Male,50 or older,236,61.627119,10548,109222,28856.564129
Male,Less than 50,286,33.667832,10009,109964,29027.91756


This looks good - but has produced <code>mult-index</code> columns. These can be useful but may be hard to work with when outputting data. Let's flatten them to new column names.

In [24]:
# use list comprehension to get new column names
new_cols = [col+'_'+function for col, function in agg_df.columns.ravel()]

# assign new column names
agg_df.columns = new_cols
agg_df

Unnamed: 0_level_0,Unnamed: 1_level_0,id_count,age_mean,salary_min,salary_max,salary_std
gender,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
Female,50 or older,203,62.374384,10480,109916,29853.819621
Female,Less than 50,275,33.883636,10509,109983,29436.61377
Male,50 or older,236,61.627119,10548,109222,28856.564129
Male,Less than 50,286,33.667832,10009,109964,29027.91756


## 17. Example - pandas: concatenating DataFrames<a id = '17'></a> 
[Back to contents](#cont)

Let's go back to the original data, and split up the <code>DataFrame</code> into 200 row chunks and put each in a new <code>DataFrame</code> stored in a <code>list</code>

In [27]:
import numpy as np

# create empty list to store dfs:
df_list = []

# use a numpy array to find lower bounds
n = 200
lower_list = np.arange(0,len(my_df),n)

# loop to split into separate dataframes
for low in lower_list:
    split_df = my_df.iloc[low:low+n,:]
    df_list.append(split_df)

Now let's arbitrarily stick the 3rd and 4th <code>DataFrames</code> back to gether - like a <code>UNION</code>

In [28]:
union_df = pd.concat(df_list[2:4])
union_df

Unnamed: 0,id,first_name,last_name,email,gender,ip_address,salary,date_of_birth,age
400,401,Tyrone,Buntine,tbuntineb4@bluehost.com,Male,221.211.251.33,74736,1983-03-29,37
401,402,Brady,O'dell,bodellb5@timesonline.co.uk,Male,167.149.10.166,99656,1948-12-01,71
402,403,Emanuel,Vaughan,evaughanb6@typepad.com,Male,135.94.208.201,59445,1951-06-04,69
403,404,Adelle,Batteson,abattesonb7@wired.com,Female,156.148.98.84,52885,2000-12-21,19
404,405,Vinny,Rennicks,vrennicksb8@utexas.edu,Female,38.130.45.163,69290,1945-11-16,74
...,...,...,...,...,...,...,...,...,...
795,796,Hewe,Lacelett,hlacelettm3@omniture.com,Male,202.164.62.230,23244,1954-11-01,65
796,797,Ricky,Dood,rdoodm4@friendfeed.com,Male,215.221.31.159,14971,1987-09-28,32
797,798,Waverley,Scarlon,wscarlonm5@scientificamerican.com,Male,208.156.172.222,54296,1965-05-17,55
798,799,Rozamond,Powles,rpowlesm6@blogs.com,Female,89.237.127.23,68632,1957-12-10,62


Next lets <code>append</code> a new row to the bottom of this new <code>DataFrame</code>. First create a <code>dict</code> of column-value pairs 

In [29]:
cols = union_df.columns.to_list()
values = (1001,'Joe','Bloggs','joebloggs99@hotmail.com','Male','345.112.543.22',13470,'17/05/1999')

newrow_dict = dict(zip(cols,values))
newrow_dict

{'id': 1001,
 'first_name': 'Joe',
 'last_name': 'Bloggs',
 'email': 'joebloggs99@hotmail.com',
 'gender': 'Male',
 'ip_address': '345.112.543.22',
 'salary': 13470,
 'date_of_birth': '17/05/1999'}

Now add to the <code>DataFrame</code>. Note that the new entry has <code>NaN</code> for age - as we calculated this earlier.

In [30]:
union_df.append(newrow_dict, ignore_index = True)

Unnamed: 0,id,first_name,last_name,email,gender,ip_address,salary,date_of_birth,age
0,401,Tyrone,Buntine,tbuntineb4@bluehost.com,Male,221.211.251.33,74736,1983-03-29 00:00:00,37.0
1,402,Brady,O'dell,bodellb5@timesonline.co.uk,Male,167.149.10.166,99656,1948-12-01 00:00:00,71.0
2,403,Emanuel,Vaughan,evaughanb6@typepad.com,Male,135.94.208.201,59445,1951-06-04 00:00:00,69.0
3,404,Adelle,Batteson,abattesonb7@wired.com,Female,156.148.98.84,52885,2000-12-21 00:00:00,19.0
4,405,Vinny,Rennicks,vrennicksb8@utexas.edu,Female,38.130.45.163,69290,1945-11-16 00:00:00,74.0
...,...,...,...,...,...,...,...,...,...
396,797,Ricky,Dood,rdoodm4@friendfeed.com,Male,215.221.31.159,14971,1987-09-28 00:00:00,32.0
397,798,Waverley,Scarlon,wscarlonm5@scientificamerican.com,Male,208.156.172.222,54296,1965-05-17 00:00:00,55.0
398,799,Rozamond,Powles,rpowlesm6@blogs.com,Female,89.237.127.23,68632,1957-12-10 00:00:00,62.0
399,800,Poul,Burbury,pburburym7@weather.com,Male,163.247.105.128,45017,1977-12-20 00:00:00,42.0


We can also join the dataframe horizontally using the <code>axis = 1</code> argument - notice that empty <code>NaN</code> values are created because the indexes don't match - this is like a SQL <code>OUTER JOIN</code>.

In [31]:
pd.concat(df_list[2:4], axis = 1)

Unnamed: 0,id,first_name,last_name,email,gender,ip_address,salary,date_of_birth,age,id.1,first_name.1,last_name.1,email.1,gender.1,ip_address.1,salary.1,date_of_birth.1,age.1
400,401.0,Tyrone,Buntine,tbuntineb4@bluehost.com,Male,221.211.251.33,74736.0,1983-03-29,37.0,,,,,,,,NaT,
401,402.0,Brady,O'dell,bodellb5@timesonline.co.uk,Male,167.149.10.166,99656.0,1948-12-01,71.0,,,,,,,,NaT,
402,403.0,Emanuel,Vaughan,evaughanb6@typepad.com,Male,135.94.208.201,59445.0,1951-06-04,69.0,,,,,,,,NaT,
403,404.0,Adelle,Batteson,abattesonb7@wired.com,Female,156.148.98.84,52885.0,2000-12-21,19.0,,,,,,,,NaT,
404,405.0,Vinny,Rennicks,vrennicksb8@utexas.edu,Female,38.130.45.163,69290.0,1945-11-16,74.0,,,,,,,,NaT,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
795,,,,,,,,NaT,,796.0,Hewe,Lacelett,hlacelettm3@omniture.com,Male,202.164.62.230,23244.0,1954-11-01,65.0
796,,,,,,,,NaT,,797.0,Ricky,Dood,rdoodm4@friendfeed.com,Male,215.221.31.159,14971.0,1987-09-28,32.0
797,,,,,,,,NaT,,798.0,Waverley,Scarlon,wscarlonm5@scientificamerican.com,Male,208.156.172.222,54296.0,1965-05-17,55.0
798,,,,,,,,NaT,,799.0,Rozamond,Powles,rpowlesm6@blogs.com,Female,89.237.127.23,68632.0,1957-12-10,62.0


If we want to ignore indexes and just stick the columns together arbitrarily <b>don't</b> use the <code>ignore_index</code> keyword as this does something else. Instead we need to <code>reset_index</code> for each individual dataframe we join

In [32]:
new_index_df_list = [df.reset_index() for df in df_list]

pd.concat(new_index_df_list[2:4], axis = 1)

Unnamed: 0,index,id,first_name,last_name,email,gender,ip_address,salary,date_of_birth,age,index.1,id.1,first_name.1,last_name.1,email.1,gender.1,ip_address.1,salary.1,date_of_birth.1,age.1
0,400,401,Tyrone,Buntine,tbuntineb4@bluehost.com,Male,221.211.251.33,74736,1983-03-29,37,600,601,Celle,Dering,cderinggo@vinaora.com,Female,67.43.105.98,75053,1954-01-18,66
1,401,402,Brady,O'dell,bodellb5@timesonline.co.uk,Male,167.149.10.166,99656,1948-12-01,71,601,602,Pedro,Deware,pdewaregp@noaa.gov,Male,217.89.241.78,19086,1953-01-18,67
2,402,403,Emanuel,Vaughan,evaughanb6@typepad.com,Male,135.94.208.201,59445,1951-06-04,69,602,603,Sandy,Varey,svareygq@list-manage.com,Female,194.102.249.26,27870,2000-12-07,19
3,403,404,Adelle,Batteson,abattesonb7@wired.com,Female,156.148.98.84,52885,2000-12-21,19,603,604,Berti,Virr,bvirrgr@godaddy.com,Male,175.233.39.240,48203,1964-03-22,56
4,404,405,Vinny,Rennicks,vrennicksb8@utexas.edu,Female,38.130.45.163,69290,1945-11-16,74,604,605,Caddric,Arpur,carpurgs@1und1.de,Male,86.40.160.203,50772,1971-12-31,48
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
195,595,596,Kelwin,Condon,kcondongj@ezinearticles.com,Male,184.98.107.154,82646,2002-09-23,17,795,796,Hewe,Lacelett,hlacelettm3@omniture.com,Male,202.164.62.230,23244,1954-11-01,65
196,596,597,Berkley,Hully,bhullygk@disqus.com,Male,169.59.29.111,28258,1996-09-28,23,796,797,Ricky,Dood,rdoodm4@friendfeed.com,Male,215.221.31.159,14971,1987-09-28,32
197,597,598,Greggory,Duerden,gduerdengl@cargocollective.com,Male,194.212.68.214,76442,1998-10-25,21,797,798,Waverley,Scarlon,wscarlonm5@scientificamerican.com,Male,208.156.172.222,54296,1965-05-17,55
198,598,599,Benetta,Haughin,bhaughingm@amazon.co.jp,Female,168.205.4.251,31512,1983-04-30,37,798,799,Rozamond,Powles,rpowlesm6@blogs.com,Female,89.237.127.23,68632,1957-12-10,62


## 18. Exercise - collecting and analysing separate data sources<a id = '18'></a>
[Back to contents](#cont)

<b>Requirement</b>
1. Read in the 6 datafiles (of unknown location) containing car sales data which sit in the folder <code>C:\Users\student\Python Training\Car Sales</code>


2. Store the region number from the filename


3. Combine the individual files into one <code>DataFrame</code>, the new <code>DataFrame</code> should include which region the file is from


4. Fix any columns which may not be formatted as the right dataype (<b>hint:</b> look at numerics and dates... you may want to look up the <code>Series.str.replace()</code>, <code>pd.to_datetime()</code> and <code>pd.to_numeric()</code> functions)


5. Write down the rrp and monthly_cost by 30% each for any Audi brand car, which has a model year earlier than 2005 (<b>hint</b>: There are multiple ways to achieve this but <code>np.where</code> could be useful here. To combine logical tests you will need to use the bitwise operator <code>&</code> not the scalar operator <code>and</code> 


6. Get rid of any entries which have no data in the numeric fields - these are erroneous


7. Use the provided dictionary <code>region_dict</code> to (<b>hint</b>) <code>map</code> regions to the region number deduced from the filenames


8. Summarize data by region and make - some stats like mean rrp would be useful


9. Output the summary to a new csv or excel file


10. BONUS - find the rarest cars


These requirements are complicated so don't expect to get them all. Facilitators will drop in and out of groups to help as much as possible. We can continue the exercise into Day 2 if we run out of time.

Feel free to use spyder or another IDE rather than Jupyter if you find it easier. Materials can also be found hon remote desktop instead.



Let's build up some pseudo code as a hint:

In [7]:
# pseudocode for one way of doing it:

# import any modules needed

# loop over all directories and files using os walk - creating lists for each
    # loop over all files in the list
        # read each the individual file into pandas dataframe
        # store the dataframe in a dicitionary with the filename as the key

# iterate over the dictionary key, value pairs
    # take the file name and use string manipulation to split it up and deduce region number
    # create new column in the dataframe for the region number
    # convert number to int from string
    
# combine separate dataframes into one
# remove extra strings around the numeric fields (e.g. £)
# convert the numeric fields to numerics / dates where required

# find elements of the dataframe where the car make is audi and year is pre 2005
# for all of these elements reduce them by 10 %
# find rows where the numeric fields are Nan
# drop these rows

#deduce region names from region number and map - and put these in dataframe

# group dataframe on region and make
# aggregate using functions of your choice

# do any rounding or changes to column names required

# define an output path based on source path
# output the summary to csv 

# group non aggregated dataframe by car make and model
# aggregate by count
# sort lowest count to highest
# find the lowest count of cars
# find all the items which have this count

In [165]:
# Your code goes here - here are some starting points

import os
import pandas as pd
import numpy as np

# mapping of numbers to regions
region_dict = dict(zip([1,2,3,4,5,6],['North','South-East','South-West','London','Midlands','Other']))

# directory path to read from
source_path = r'C:\Users\student\Desktop\Python Training\Car Sales'

# initialise a dictionary to store the files
read_dict = {}

# loop over directories, subdirectories, files as a starting point
for dirs, subdirs, files in os.walk(source_path):
    # loop over list of files
    for file in files:
        # read in the files!
        print('The rest is up to you!')
        

Enter your code here
Enter your code here
Enter your code here
Enter your code here
Enter your code here
Enter your code here
