# Introduction
- Data wrangling is the process of gathering your data, accessing its quality and structure and cleaning it before you do things like analysis, visualizations or build perspective models using machine learning

# Course Outline
1. Lesson 1: The Walkthrough
2. Lesson 2-4: Gathering Assesing and Cleaning Data (in detail)

# Why wrangle data?
- Data is being produced in large amounts so data savviness will only become more and more important in the future

# Data Wrangling Examples
- Data Wrangling is serious business. The consequences from a lack of data wrangling can have a major impact
    * __Financial Analyst__ - If you are creating models to make million dollar trades, your data better be clean or you'll go broke
    * __Drug Company Scientist__ - If your company is about to start human trials for a life saving new drug and you need to determine the right dosage for humans based on your lab and animal tests, your data needs to be clean or your drug might not work and you could seriously hurt people

# Walkthrough and dataset
- We start with brief introuductions to gathering, then assessing and then cleaning our data which are our three core steps in the data wrangling process
- The dataset to be wrangled is a [dataset of 19000 online job posts from 2004 to 2015](https://www.kaggle.com/datasets/udacity/armenian-online-job-postings) that were posted through an Armenian human resource portal
- The dataset is dirty and messy enough that you'll have wrangling work to do but alo clean enough that it wont give you a headache.

## Gather(Intro)
__What is data gathering?__ 
- Gathering is sometimes called acquiring or collecting data.
- Data sources; **Files, Database, scrapped off a website, API**

## Gather(Download)
- Downloading can be done manually by clicking the download button or sometimes right clicking on a link and clicking "Save file as"
- Best practice is to download files programmatically for **scalibility** and **reproducibility**

1. Scalability - The ability of a process to handle an increasing scope of work. Imsgine you had a thousand files to download on a thousand different web pages instead of just one. It'd take an eternity to point and click a thousand times. You can do the same with a few lines of code
2. Reproducibility - The ability of a process to produce the same results from identical inputs. Someone other than yourself will want to run your analysis later, so make downloading the datasets as easy to that person as possible
- Reproducibility is also one of the main principles of [scientific methods](https://en.wikipedia.org/wiki/Scientific_method#Documentation_and_replication). You wanna be able to prove to people that your analysis and visualizations are legitimate plus the dataset on the web page it lives may change

## Gather(unzip file)
- Using code to unzip files makes your wrangling work more reproducibe than using an external program or clicking and unzipping the file
* import the zip file library
- `zipfile.Zipfile` is the class for reading and writing zip files

In [1]:
import zipfile

In [3]:
# extracting all contents from a zipfile
with zipfile.ZipFile('archive.zip', 'r') as data:
    data.extractall()

## Gather: Import

In [2]:
import pandas as pd

In [3]:
df = pd.read_csv("online-job-postings.csv")

# Asses
### Intro
#### Data Quality
- Low quality data is commonly reffered to as dirty data. Dirty data has issues with its content.
- Data quality is an assesment or a perception of data's fitness to serve its purpose in a given context
- There are no hard and fast rules for data quality. One dataset may be high enough for one application but not for another
- Common data quality issuees include;
    * missing data
    * invalid data - a cell having an impossible value i.e -50 inches as someone's height, having inches in the height column as well as it makes it a string
    * inaccurate data
    * inconsistent data - i.e using different units for height

#### Tidiness
- Untidy data is commonly referred to as messy data. Messy data has issues with its structure
- A dataset is messy or tidy depending on how rows columns and tables are matched up with observations, variables and types. In tidy data;
    * Each variable forms a column
    * Each observation forms a row
    * Each type of observation unit forms a table

### Types of Assesment
1. __virtual assesment__ - opening your data in your favourite software application and scroll through it looking for quality and tidiness issues
2. __pragmatic assesment__ - tends to be more efficient than visual assesment. One simple example is the pandas `info` method which gives us a basic info of the dataframe like number of entries, number of columns, the types of each column, whether there are missing values and more. Another example is the pandas plotting capabilities through the `plot` method

## Visually inspecting the data

In [7]:
# visually inspecting the data
df

Unnamed: 0,jobpost,date,Title,Company,AnnouncementCode,Term,Eligibility,Audience,StartDate,Duration,...,Salary,ApplicationP,OpeningDate,Deadline,Notes,AboutC,Attach,Year,Month,IT
0,AMERIA Investment Consulting Company\r\nJOB TI...,"Jan 5, 2004",Chief Financial Officer,AMERIA Investment Consulting Company,,,,,,,...,,"To apply for this position, please submit a\r\...",,26 January 2004,,,,2004,1,False
1,International Research & Exchanges Board (IREX...,"Jan 7, 2004",Full-time Community Connections Intern (paid i...,International Research & Exchanges Board (IREX),,,,,,3 months,...,,Please submit a cover letter and resume to:\r\...,,12 January 2004,,The International Research & Exchanges Board (...,,2004,1,False
2,Caucasus Environmental NGO Network (CENN)\r\nJ...,"Jan 7, 2004",Country Coordinator,Caucasus Environmental NGO Network (CENN),,,,,,Renewable annual contract\r\nPOSITION,...,,Please send resume or CV toursula.kazarian@......,,20 January 2004\r\nSTART DATE: February 2004,,The Caucasus Environmental NGO Network is a\r\...,,2004,1,False
3,Manoff Group\r\nJOB TITLE: BCC Specialist\r\n...,"Jan 7, 2004",BCC Specialist,Manoff Group,,,,,,,...,,Please send cover letter and resume to Amy\r\n...,,23 January 2004\r\nSTART DATE: Immediate,,,,2004,1,False
4,Yerevan Brandy Company\r\nJOB TITLE: Software...,"Jan 10, 2004",Software Developer,Yerevan Brandy Company,,,,,,,...,,Successful candidates should submit\r\n- CV; \...,,"20 January 2004, 18:00",,,,2004,1,True
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
18996,Technolinguistics NGO\r\n\r\n\r\nTITLE: Senio...,"Dec 28, 2015",Senior Creative UX/ UI Designer,Technolinguistics NGO,,Full-time,,,,Long-term,...,Competitive,"To apply for this position, please send your\r...",29 December 2015,28 January 2016,,As a company Technolinguistics has a mandate t...,,2015,12,False
18997,"""Coca-Cola Hellenic Bottling Company Armenia"" ...","Dec 30, 2015",Category Development Manager,"""Coca-Cola Hellenic Bottling Company Armenia"" ...",,Full-time,All interested professionals.,,ASAP,Long-term with a probation period of 3 months.,...,,All interested candidates are kindly requested...,30 December 2015,20 January 2016,,,,2015,12,False
18998,"""Coca-Cola Hellenic Bottling Company Armenia"" ...","Dec 30, 2015",Operational Marketing Manager,"""Coca-Cola Hellenic Bottling Company Armenia"" ...",,Full-time,All interested professionals.,,ASAP,Long-term with a probation period of 3 months.,...,,All interested candidates are kindly requested...,30 December 2015,20 January 2016,,,,2015,12,False
18999,San Lazzaro LLC\r\n\r\n\r\nTITLE: Head of O...,"Dec 30, 2015",Head of Online Sales Department,San Lazzaro LLC,,,,,,Long-term,...,Highly competitive,Interested candidates can send their CVs to:\r...,30 December 2015,29 January 2016,,San Lazzaro LLC works with several internation...,,2015,12,False


* Missing values (NaN)
* StartDate inconsistencies

- We've got 19001 rows and 24 columns

#### What needs to be fixed?
1. __Missing Records__
- Missing records are a common issue. _pandas_ represents them as `NaN`  which means not a number

2. __Multiple terms that mean the same thing__
- In the **Start Date** column, "ASAP" "As soon as possible" and "immediately" all mean the same thing. Ideally the same term should be used in all cases

#### What doesn't need to be fixed
1. __Escape Characters__
- The backslashes followed by letters are called Escape Constructs. They help structure text on web pages using new lines and special characters
- We want to preserve the structure of the original posting so we wont need to fix this in the **Clean** step

- It's good practice to document all your assesment before starting the **Clean** step for easy referencing to prevent hectic scrolling

## Pragmatic Assesment
##### The Downside of visual assesment
- If we scroll through our DataFrame a bit more, we'll notice that pandas automatically collapses some columns and rows of larger datasets. It does this because viewing large datasetes takes up too much space. This makes visual assesment difficult in pandas

##### Pragmatic assesment in pandas
- We use the pandas `info` function to get a concise summary of our dataset
- We ge information including; number of entries, name and datatypes of columns, total memory usage

- Nondescriptive column headers (ApplicationP, AboutC, RequiredQual)

### Observations about our dataset 
- Some of the column names are problematic
- What is `ApplicationP`, `AboutC`, `RequiredQual`. One of the fundamentals of quality data is having descriptive variable and value names
- `JobRequirement` is descriptive enough but has a typo that needs fixing
- Assesments should not include verbs like "__Fix__ nondescriptive column names". They should only be observations of issues with the data i.e "nondescriptive column headers". Verbs come into play when defininig cleaning operations

## Explore common Pragmatic Assesments
- Its another area where we use code tohelp detect problems that are not easily detectable with the human eye
- Four common pragmatic assesment methhods in pandas; __head__, __tail__, __info__ and __value_counts__

In [4]:
# displaying the first five rows of the dataframe
df.head()

Unnamed: 0,jobpost,date,Title,Company,AnnouncementCode,Term,Eligibility,Audience,StartDate,Duration,...,Salary,ApplicationP,OpeningDate,Deadline,Notes,AboutC,Attach,Year,Month,IT
0,AMERIA Investment Consulting Company\r\nJOB TI...,"Jan 5, 2004",Chief Financial Officer,AMERIA Investment Consulting Company,,,,,,,...,,"To apply for this position, please submit a\r\...",,26 January 2004,,,,2004,1,False
1,International Research & Exchanges Board (IREX...,"Jan 7, 2004",Full-time Community Connections Intern (paid i...,International Research & Exchanges Board (IREX),,,,,,3 months,...,,Please submit a cover letter and resume to:\r\...,,12 January 2004,,The International Research & Exchanges Board (...,,2004,1,False
2,Caucasus Environmental NGO Network (CENN)\r\nJ...,"Jan 7, 2004",Country Coordinator,Caucasus Environmental NGO Network (CENN),,,,,,Renewable annual contract\r\nPOSITION,...,,Please send resume or CV toursula.kazarian@......,,20 January 2004\r\nSTART DATE: February 2004,,The Caucasus Environmental NGO Network is a\r\...,,2004,1,False
3,Manoff Group\r\nJOB TITLE: BCC Specialist\r\n...,"Jan 7, 2004",BCC Specialist,Manoff Group,,,,,,,...,,Please send cover letter and resume to Amy\r\n...,,23 January 2004\r\nSTART DATE: Immediate,,,,2004,1,False
4,Yerevan Brandy Company\r\nJOB TITLE: Software...,"Jan 10, 2004",Software Developer,Yerevan Brandy Company,,,,,,,...,,Successful candidates should submit\r\n- CV; \...,,"20 January 2004, 18:00",,,,2004,1,True


In [5]:
# displaying the last five rows
df.tail()

Unnamed: 0,jobpost,date,Title,Company,AnnouncementCode,Term,Eligibility,Audience,StartDate,Duration,...,Salary,ApplicationP,OpeningDate,Deadline,Notes,AboutC,Attach,Year,Month,IT
18996,Technolinguistics NGO\r\n\r\n\r\nTITLE: Senio...,"Dec 28, 2015",Senior Creative UX/ UI Designer,Technolinguistics NGO,,Full-time,,,,Long-term,...,Competitive,"To apply for this position, please send your\r...",29 December 2015,28 January 2016,,As a company Technolinguistics has a mandate t...,,2015,12,False
18997,"""Coca-Cola Hellenic Bottling Company Armenia"" ...","Dec 30, 2015",Category Development Manager,"""Coca-Cola Hellenic Bottling Company Armenia"" ...",,Full-time,All interested professionals.,,ASAP,Long-term with a probation period of 3 months.,...,,All interested candidates are kindly requested...,30 December 2015,20 January 2016,,,,2015,12,False
18998,"""Coca-Cola Hellenic Bottling Company Armenia"" ...","Dec 30, 2015",Operational Marketing Manager,"""Coca-Cola Hellenic Bottling Company Armenia"" ...",,Full-time,All interested professionals.,,ASAP,Long-term with a probation period of 3 months.,...,,All interested candidates are kindly requested...,30 December 2015,20 January 2016,,,,2015,12,False
18999,San Lazzaro LLC\r\n\r\n\r\nTITLE: Head of O...,"Dec 30, 2015",Head of Online Sales Department,San Lazzaro LLC,,,,,,Long-term,...,Highly competitive,Interested candidates can send their CVs to:\r...,30 December 2015,29 January 2016,,San Lazzaro LLC works with several internation...,,2015,12,False
19000,"""Kamurj"" UCO CJSC\r\n\r\n\r\nTITLE: Lawyer in...","Dec 30, 2015",Lawyer in Legal Department,"""Kamurj"" UCO CJSC",,Full-time,,,,Indefinite,...,,All qualified applicants are encouraged to\r\n...,30 December 2015,20 January 2016,,"""Kamurj"" UCO CJSC is providing micro and small...",,2015,12,False


In [6]:
# displaying a basic summary of the dataframe
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 19001 entries, 0 to 19000
Data columns (total 24 columns):
 #   Column            Non-Null Count  Dtype 
---  ------            --------------  ----- 
 0   jobpost           19001 non-null  object
 1   date              19001 non-null  object
 2   Title             18973 non-null  object
 3   Company           18994 non-null  object
 4   AnnouncementCode  1208 non-null   object
 5   Term              7676 non-null   object
 6   Eligibility       4930 non-null   object
 7   Audience          640 non-null    object
 8   StartDate         9675 non-null   object
 9   Duration          10798 non-null  object
 10  Location          18969 non-null  object
 11  JobDescription    15109 non-null  object
 12  JobRequirment     16479 non-null  object
 13  RequiredQual      18517 non-null  object
 14  Salary            9622 non-null   object
 15  ApplicationP      18941 non-null  object
 16  OpeningDate       18295 non-null  object
 17  Deadline    

In [10]:
# Display the entry counts for the year column using value_counts
df.Year.value_counts(normalize=True) # contains relative frequencies of the unique values
# df.Year.value_counts()

2012    0.113099
2013    0.105731
2015    0.105731
2014    0.104363
2008    0.093942
2011    0.089311
2007    0.080943
2010    0.079522
2009    0.062681
2005    0.059892
2006    0.058734
2004    0.046050
Name: Year, dtype: float64

- Missing values (NaN)
- StartDate inconsistencies (ASAP)
- Nondescriptive column headers (ApplicationP, AboutC, RequiredQual ..and also JobRequirement)

# Clean