<a href="https://colab.research.google.com/github/njaincode/python_for_data_science/blob/main/Interrogating_dataframes.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Interrogating the dataframe
---

We use pandas to create dataframes from data sets in files or online tables.

Once we have the dataframe we might want to:  
*   display a subset of rows
*   *   we can do this using df.head(), df.tail() or df.iloc[index]
*   display a subset of columns
*   *   we can do this using the column name as the key, e.g.   
`                df['rain (mm)']`  
*   *   we can display more than one column, simply by listing the columns in the square brackets, e.g.   
`                df[['rain (mm)', 'tmin (degC)', 'sun (hours)']]`  
*   *   and can dislay a subset of rows for a subset of columns, e.g.  
 `               df[['rain (mm)', 'tmax (degC)']].head(10)`




### Exercise 1 - get the data, look at first 10 records
---


1.  Import the pandas library with the alias pd  
2.  Read the sheet with the name "Industry Migration" from the Excel file from this link: https://github.com/futureCodersSE/working-with-data/blob/main/Data%20sets/public_use-talent-migration.xlsx?raw=true  (*Hint:  use pd.read_excel(url, sheet_name="Industry Migration") with the above url.*)
3.  display the first 10 rows of the data



In [None]:
import pandas as pd

def display_10_records():
  excel_url = 'https://github.com/futureCodersSE/working-with-data/blob/main/Data%20sets/public_use-talent-migration.xlsx?raw=true'
  df = pd.read_excel(excel_url, sheet_name='Industry Migration')
  
  total_rows = len(df.axes[0])
  total_col = len(df.axes[1])
  print(f'rows={total_rows}, col={total_col}')

  # Setup some display options
  pd.options.display.max_columns= total_col
  pd.options.display.max_rows= total_rows
  
  # Just to see
  print(df.shape)
  print(df.info())
  
  # FIXME Nice way to print?
  print(df.head(10))

display_10_records()

### Exercise 2 - display just two columns

Display the first 50 records focusing on the industries where migration is happening in each country (show just country_name and industry_name).

In [None]:
import pandas as pd

def display_2col_records():
  excel_url = 'https://github.com/futureCodersSE/working-with-data/blob/main/Data%20sets/public_use-talent-migration.xlsx?raw=true'
  df = pd.read_excel(excel_url, sheet_name='Industry Migration')
  
  total_rows = len(df.axes[0])
  total_col = len(df.axes[1])
  print(f'rows={total_rows}, col={total_col}')
  pd.options.display.max_columns= total_col
  pd.options.display.max_rows= 50
  
  # For debug
  #print(df.info())

  print(f'Printing country and industry names\n')
  print(df[['country_name' , 'industry_name']].head(50))
  
display_2col_records()

### Exercise 3 - yearly migration

*   display the middle 20 records, showing the country_name and the migration for each of the years 2015 to 2019.
---


In [None]:
import pandas as pd

def display_mid20_records():
  excel_url = 'https://github.com/futureCodersSE/working-with-data/blob/main/Data%20sets/public_use-talent-migration.xlsx?raw=true'
  df = pd.read_excel(excel_url, sheet_name='Industry Migration')
  
  total_rows = len(df.axes[0])
  total_col = len(df.axes[1])
  
  # Set some display options when run
  pd.options.display.max_columns= total_col
  pd.options.display.max_rows= 20

  # Calculate ball park mid 20 indexes assuming there are >=20 rows
  mid = int(total_rows/2)
  idx_low = mid-10
  idx_high = mid+9
  print(f'rows={total_rows}, col={total_col}, mid_range {idx_low}: {idx_high}')
  
  # For debug
  #print(df.info())
  #print(df.columns)

  # Get column number from column name
  # Wanted to use col num to print but don't know how, need to explore more
  col_idx_2015 = df.columns.get_loc('net_per_10K_2015')

  # FIXME Can specific col name and range be used together in .loc?
  
  subset = df.loc[idx_low:idx_high, ['country_name', 'net_per_10K_2015', 'net_per_10K_2016','net_per_10K_2017','net_per_10K_2018','net_per_10K_2019']]
  print(subset.shape)
  print(f'Displaying mid 20 records with country name and migration from years [2015:2019]\n')
  print(subset)
  

display_mid20_records()

### Exercise 4 - migration to and from
---

The Excel file at this link (which you have already opened above): https://github.com/futureCodersSE/working-with-data/blob/main/Data%20sets/public_use-talent-migration.xlsx?raw=true has three data sheets, "Country Migration", "Industry Migration" and "Skills Migration"

Open the data sheet "Country Migration" and display the first 10 records showing the country that migration was to, the country it was from and the net migration for each year from 2015 to 2019

In [None]:
import pandas as pd

def display_record(df, dest, src, net1):
  subset = df.loc[0:9, [dest, src, net1]]
  print(subset.shape)
  print(f'Displaying mid 20 records with country name and migration from years [2015:2019]\n')
  print(subset)

def display_country_migration():
  excel_url = 'https://github.com/futureCodersSE/working-with-data/blob/main/Data%20sets/public_use-talent-migration.xlsx?raw=true'
  df = pd.read_excel(excel_url, sheet_name='Country Migration')
  total_rows = len(df.axes[0])
  total_col = len(df.axes[1])
  
  # For debug
  #print(f'rows={total_rows}, col={total_col}')
  
  # Set some display options when run
  pd.options.display.max_rows= 10
  
  # Migrated to, migrated from and net migration for years [2015:2019]
  cols = ['target_country_name', 'base_country_name', 'net_per_10K_2015', 'net_per_10K_2016', 'net_per_10K_2017', 'net_per_10K_2018', 'net_per_10K_2019']
  df1 = df[cols]
  print(df1.head(10))
  #display_record(df, 'target_country_name', 'base_country_name', 'net_per_10K_2015')

   
display_country_migration()

### Exercise 5 - unique countries

Using the "Country Migration" sheet, get all the unique country names from where people have migrated (base_country_name) and use a for loop to print a list of the countries.

In [None]:
import pandas as pd

def display_unique_countries():
  excel_url = 'https://github.com/futureCodersSE/working-with-data/blob/main/Data%20sets/public_use-talent-migration.xlsx?raw=true'
  df = pd.read_excel(excel_url, sheet_name='Country Migration')
    
  # Set some display options when run
  pd.options.display.max_rows= 10
  
  # Just for learning
  #bah = df[df.base_country_name == 'Bahrain']
  #print(bah.head(10))
  #print(df.keys())
  
  # Find unique base_country_name
  # Get all rows from base_country_name key
  df_base_name = df['base_country_name']
  print(f'Total number of entries in base_country_name = {df_base_name.shape}')
  # For debug
  #print(df_base_name.head(10))
  
  # Another way
  #df_uniq = df.drop_duplicates(subset = ['base_country_name'])
  
  # Get all unique values in rows
  df_base_name_unique = df_base_name.drop_duplicates()
  print(f'Total number of unique entries in base_country_name = {df_base_name_unique.shape}')
    
  print(f'Displaying unique country names using for loop')
  for i in df_base_name_unique:
    print(i)
     

display_unique_countries()


### Exercise 6 - how many countries are migrated to

Using the "Country Migration" sheet again, get all the unique country names to where people have migrated and display the number of unique countries.

In [None]:
import pandas as pd

def display_migration_to_unique_countries():
  excel_url = 'https://github.com/futureCodersSE/working-with-data/blob/main/Data%20sets/public_use-talent-migration.xlsx?raw=true'
  df = pd.read_excel(excel_url, sheet_name='Country Migration')
    
  # Set some display options when run
  pd.options.display.max_rows= 10
  
  col = 'target_country_name'
  # Find unique target_country_name
  # Get all rows from target_country_name key
  df_base_name = df[col]
  print(f'Total number of entries in {col} = {df_base_name.shape}')

  
  # Get all unique values in rows
  df_base_name_unique = df_base_name.drop_duplicates()
  print(f'Total number of unique entries in {col} = {df_base_name_unique.shape}')
    
  print(f'Displaying unique country names using for loop')
  for i in df_base_name_unique:
    print(i)
     

display_migration_to_unique_countries()

### Exercise 7 - skill sets
---


Using the "Skill Migration" data sheet, display the following:
*  the first record
*  the last 10 records
*  records from index 500 to index 600  
*  a count of all unique countries
*  a list of all unique skill group categories
*  the first 50 skill group names and net migration in 2019

In [None]:
import pandas as pd

def display_skill_migration_data_samples():
  excel_url = 'https://github.com/futureCodersSE/working-with-data/blob/main/Data%20sets/public_use-talent-migration.xlsx?raw=true'
  user_sheet_name = 'Skill Migration'
  df = pd.read_excel(excel_url, sheet_name=user_sheet_name)

  # Print shape to just get an idea about samples
  print(f'Shape of {user_sheet_name} is {df.shape}')

  total_col = len(df.axes[1])
  # Set some display options when run
  pd.options.display.max_columns= total_col
  pd.options.display.max_rows= 100

  # Print the first record
  print(f'First record {df.head(1)}')

  # Print last 10 records
  print(f'Last 10 records {df.tail(10)}')

  # Print records [500:600]
  # FIXME Why not all rows are getting printed?
  print(f'Printing [500:600] \n {df.iloc[500:601]}')

  # a count of all unique countries
  col = 'country_name'
  df_country_name = df.drop_duplicates(subset = [col])
  count = len(df_country_name.axes[0])
    print(f'Unique country_name shape {df_country_name.shape} and total count {count}')

  # a list of all unique skill group categories
  col = 'skill_group_category'
  df_skill_grp = df.drop_duplicates(subset = [col])
  count = len(df_skill_grp.axes[0])
    print(f'Unique skill group count is {count}')
  
  # the first 50 skill group names and net migration in 2019
  col = ['skill_group_name', 'net_per_10K_2019']
  df_skill_grp_net_mig_2019 = df[col]
  # Just to get an idea
  print(df_skill_grp_net_mig_2019.shape)
  print(df_skill_grp_net_mig_2019.head(50))
  

display_skill_migration_data_samples()

### Exercise 8 - total net migration over 5 years
---

Using the Country Migration data sheet, display the record at position 10.  Using the column names to access the data in each column, add the five columns containing the net migration data together to calculate the total net migration over the 5 years between 2015 and 2019.

In [None]:
import pandas as pd

def net_migration_in_5yrs():
  excel_url = 'https://github.com/futureCodersSE/working-with-data/blob/main/Data%20sets/public_use-talent-migration.xlsx?raw=true'
  user_sheet_name = 'Country Migration'
  df = pd.read_excel(excel_url, sheet_name=user_sheet_name)

  # Print shape to just get an idea about samples
  print(f'Shape of {user_sheet_name} is {df.shape}')

  total_col = len(df.axes[1])
  # Set some display options when run
  pd.options.display.max_columns= total_col
  pd.options.display.max_rows= 100

  # display the record at position 10
  print(df.iloc[10])

  # calculate the total net migration over the 5 years between 2015 and 2019
  net_migration_in_2015 = df.net_per_10K_2015.sum()
  print(f'net_migration_in_2015 = {net_migration_in_2015}')
  
  net_migration_in_2016 = df.net_per_10K_2016.sum()
  print(f'net_migration_in_2016 = {net_migration_in_2016}')

  net_migration_in_2017 = df.net_per_10K_2017.sum()
  print(f'net_migration_in_2017 = {net_migration_in_2017}')

  net_migration_in_2018 = df.net_per_10K_2018.sum()
  print(f'net_migration_in_2018 = {net_migration_in_2018}')

  net_migration_in_2019 = df.net_per_10K_2019.sum()
  print(f'net_migration_in_2019 = {net_migration_in_2019}')

  net_migration_2015_to_2019 = net_migration_in_2015 + net_migration_in_2016 + net_migration_in_2017 + net_migration_in_2018 + net_migration_in_2019
  print(f'net_migration_2015_to_2019 is {net_migration_2015_to_2019}')


net_migration_in_5yrs()

### Exercise 9 - formulating a text response

Using the Industry Migration data sheet, display the record at position 100.  Using the column names again to access the data in each column, calculate net migration over the 5 years from 2015 to 2019.

Then print the following message, adding the data where indicated:

The total net migration into `country` of skilled workers for the `industry name` industry, during the years 2015-2019 was `total_net_migration`

In [None]:
import pandas as pd

def sum_net_migration(df):
  # calculate the total net migration over the 5 years between 2015 and 2019
  net_migration_in_2015 = df.net_per_10K_2015
  print(f'net_migration_in_2015 = {net_migration_in_2015}')
  
  net_migration_in_2016 = df.net_per_10K_2016
  print(f'net_migration_in_2016 = {net_migration_in_2016}')

  net_migration_in_2017 = df.net_per_10K_2017
  print(f'net_migration_in_2017 = {net_migration_in_2017}')

  net_migration_in_2018 = df.net_per_10K_2018
  print(f'net_migration_in_2018 = {net_migration_in_2018}')

  net_migration_in_2019 = df.net_per_10K_2019
  print(f'net_migration_in_2019 = {net_migration_in_2019}')

  net_migration_2015_to_2019 = net_migration_in_2015 + net_migration_in_2016 + net_migration_in_2017 + net_migration_in_2018 + net_migration_in_2019
  print(f'net_migration_2015_to_2019 is {net_migration_2015_to_2019}')
  
  return net_migration_2015_to_2019


def net_industry_migration_in_5yrs():
  excel_url = 'https://github.com/futureCodersSE/working-with-data/blob/main/Data%20sets/public_use-talent-migration.xlsx?raw=true'
  user_sheet_name = 'Industry Migration'
  df = pd.read_excel(excel_url, sheet_name=user_sheet_name)

  # Print shape to just get an idea about samples
  print(f'Shape of {user_sheet_name} is {df.shape}')

  total_col = len(df.axes[1])
  # Set some display options when run
  pd.options.display.max_columns= total_col
  pd.options.display.max_rows= 100

  # display the record at position 100
  df_100 = df.iloc[100]
  print(df_100)
  
  country_name = df_100['country_name']
  industry_name = df_100['industry_name'] 
  # calculate net migration for record 100 over the 5 years from 2015 to 2019
  # FIXME Is it just for record 100 or for all countries? I might have got exercise 8 wrong then :(
  net_migration = sum_net_migration(df_100)

  print(f'The total net migration into {country_name} of skilled workers for the {industry_name} industry, during the years 2015-2019 was {net_migration}')
  
net_industry_migration_in_5yrs()


### Exercise 10 - five text responses

Using the Industry Migration data sheet, calculate net migration (rounded to 1 decimal place) over the 5 years from 2015 to 2019 for the records at positions 1 to 5 and print the following message for each, adding the data where indicated:

The total net migration into `country` of skilled workers for the `industry name` industry, during the years 2015-2019 was `total_net_migration`

*Hint: use a for loop to get a record and do each calculation*

In [None]:
import pandas as pd

def sum_net_migration(df):
  # calculate the total net migration over the 5 years between 2015 and 2019
  
  net_migration_2015_to_2019 = df.net_per_10K_2015 + df.net_per_10K_2016 + df.net_per_10K_2017 + df.net_per_10K_2018 + df.net_per_10K_2019
  #print(f'net_migration_2015_to_2019 is {net_migration_2015_to_2019}')
  
  return net_migration_2015_to_2019

def industry_migration_for_top_5_records():
  excel_url = 'https://github.com/futureCodersSE/working-with-data/blob/main/Data%20sets/public_use-talent-migration.xlsx?raw=true'
  user_sheet_name = 'Industry Migration'
  df = pd.read_excel(excel_url, sheet_name=user_sheet_name)

  # Print shape to just get an idea about samples
  print(f'Shape of {user_sheet_name} is {df.shape}')

  for i in range (5):
    df_i = df.iloc[i]
    #print(df_i)

    country_name = df_i['country_name']
    industry_name = df_i['industry_name'] 
    net_migration = round(sum_net_migration(df_i), 1)

    print(f'The total net migration into {country_name} of skilled workers for the {industry_name} industry, during the years 2015-2019 was {net_migration}')
  
industry_migration_for_top_5_records()