# Introduction
- Data wrangling is the process of gathering your data, accessing its quality and structure and cleaning it before you do things like analysis, visualizations or build perspective models using machine learning

# Course Outline
1. Lesson 1: The Walkthrough
2. Lesson 2-4: Gathering Assesing and Cleaning Data (in detail)

# Why wrangle data?
- Data is being produced in large amounts so data savviness will only become more and more important in the future

# Data Wrangling Examples
- Data Wrangling is serious business. The consequences from a lack of data wrangling can have a major impact
    * __Financial Analyst__ - If you are creating models to make million dollar trades, your data better be clean or you'll go broke
    * __Drug Company Scientist__ - If your company is about to start human trials for a life saving new drug and you need to determine the right dosage for humans based on your lab and animal tests, your data needs to be clean or your drug might not work and you could seriously hurt people

# Walkthrough and dataset
- We start with brief introuductions to gathering, then assessing and then cleaning our data which are our three core steps in the data wrangling process
- The dataset to be wrangled is a [dataset of 19000 online job posts from 2004 to 2015](https://www.kaggle.com/datasets/udacity/armenian-online-job-postings) that were posted through an Armenian human resource portal
- The dataset is dirty and messy enough that you'll have wrangling work to do but alo clean enough that it wont give you a headache.

## Gather(Intro)
__What is data gathering?__ 
- Gathering is sometimes called acquiring or collecting data.
- Data sources; **Files, Database, scrapped off a website, API**

## Gather(Download)
- Downloading can be done manually by clicking the download button or sometimes right clicking on a link and clicking "Save file as"
- Best practice is to download files programmatically for **scalibility** and **reproducibility**

1. Scalability - The ability of a process to handle an increasing scope of work. Imsgine you had a thousand files to download on a thousand different web pages instead of just one. It'd take an eternity to point and click a thousand times. You can do the same with a few lines of code
2. Reproducibility - The ability of a process to produce the same results from identical inputs. Someone other than yourself will want to run your analysis later, so make downloading the datasets as easy to that person as possible
- Reproducibility is also one of the main principles of [scientific methods](https://en.wikipedia.org/wiki/Scientific_method#Documentation_and_replication). You wanna be able to prove to people that your analysis and visualizations are legitimate plus the dataset on the web page it lives may change

## Gather(unzip file)
- Using code to unzip files makes your wrangling work more reproducibe than using an external program or clicking and unzipping the file
* import the zip file library
- `zipfile.Zipfile` is the class for reading and writing zip files

In [1]:
import zipfile

In [3]:
# extracting all contents from a zipfile
with zipfile.ZipFile('archive.zip', 'r') as data:
    data.extractall()

## Gather: Import

In [4]:
import pandas as pd

In [5]:
df = pd.read_csv("online-job-postings.csv")

In [6]:
df.head()

Unnamed: 0,jobpost,date,Title,Company,AnnouncementCode,Term,Eligibility,Audience,StartDate,Duration,...,Salary,ApplicationP,OpeningDate,Deadline,Notes,AboutC,Attach,Year,Month,IT
0,AMERIA Investment Consulting Company\r\nJOB TI...,"Jan 5, 2004",Chief Financial Officer,AMERIA Investment Consulting Company,,,,,,,...,,"To apply for this position, please submit a\r\...",,26 January 2004,,,,2004,1,False
1,International Research & Exchanges Board (IREX...,"Jan 7, 2004",Full-time Community Connections Intern (paid i...,International Research & Exchanges Board (IREX),,,,,,3 months,...,,Please submit a cover letter and resume to:\r\...,,12 January 2004,,The International Research & Exchanges Board (...,,2004,1,False
2,Caucasus Environmental NGO Network (CENN)\r\nJ...,"Jan 7, 2004",Country Coordinator,Caucasus Environmental NGO Network (CENN),,,,,,Renewable annual contract\r\nPOSITION,...,,Please send resume or CV toursula.kazarian@......,,20 January 2004\r\nSTART DATE: February 2004,,The Caucasus Environmental NGO Network is a\r\...,,2004,1,False
3,Manoff Group\r\nJOB TITLE: BCC Specialist\r\n...,"Jan 7, 2004",BCC Specialist,Manoff Group,,,,,,,...,,Please send cover letter and resume to Amy\r\n...,,23 January 2004\r\nSTART DATE: Immediate,,,,2004,1,False
4,Yerevan Brandy Company\r\nJOB TITLE: Software...,"Jan 10, 2004",Software Developer,Yerevan Brandy Company,,,,,,,...,,Successful candidates should submit\r\n- CV; \...,,"20 January 2004, 18:00",,,,2004,1,True
