# Data Munging: Strings

Data munging, the process of wrestling with data to make it into something clean and usable, is an important part of any job analyzing data.

Today we're going to focus on some data that has information we want, but the information is not properly *structured*. In particular, it comes as a single column with a string value, and we want to turn it into a series of boolean columns.

To do that, we're going to use the powerful built-in methods Python provides us to work with strings. You can read all about the available methods here: 

https://docs.python.org/3/library/string.html

In particular, we're going to use `.split()`, which is a method that turns a string into a list of strings, and `.strip()`, which removes the "whitespace" from a string.

In [None]:
# Play:
#
# Take a look at the official Python documentation for the
# "split" and "strip" methods. Play around with them now
# to make sure you understand how they work:



In [None]:
#
# 1) 
# Read the data in a csv called "jobs.csv" into a DataFrame.
# This data is from a site that posts job ads online. 
# Each row represents an ad for a job on the site.


In [None]:
# 
# Take a look at your data and note that you have
# a column called `pay`. That column is a string,
# as far as Python is concerned. However, to us
# humans, we notice that the information is more
# structured than that. It seems like a "collection
# of keywords," where each job can have zero or more
# keywords such as "Part-Time" or "Contract" which
# describe the type of contract.
#
# There are 6 different contract types. 
# 
# Your goal:
# Transform the DataFrame, adding 6 boolean columns, 
# one for each contract type, indicating whether or
# not that job has that contract type.
#
# NOTE: This is a relatively large task. 
# Break it down into a series of steps, just like
# we did in the last exercises. Work on each
# step separately.
#
# Many of the steps will require to work with the
# string methods mentioned above. 

In [None]:
#
# 2)
# Break down your tasks, write a "pipeline" function
# called "add_contract_types".
#
# HINT: last time, each "step" returned a DataFrame
# object. This might not be the case this time, the
# steps can return any data type that is helpful
# to move the to next step!



In [None]:
#
# 3) 
# Now write all the "steps" (functions) needed
# by your pipeline function (add_contract_types)


In [None]:
# 
# 4)
# Now add the needed columns by using your function
# add_contract_types. You will want the returned
# DataFrame for some of the further exercises.


In [None]:
#
# 5) 
# Assume that all jobs that don't specify a contract
# type in "pay" are Full-time. Create a new column, 
# called "Full-time", which is a boolean that 
# should be True if the job is Full-time, false otherwise.


In [None]:
#
# 6)
# Get the percentage of jobs for each contract type
# i.e. number of jobs of X type / number of jobs


In [None]:
# 
# 7)
# Which industries ('category') have the highest
# percentage of part-time jobs posted?
# The lowest?


In [None]:
#
# 8)
# Which industries ('category') have the highest
# percentage of Internship jobs posted?
# The lowest?

# Note: this question is very similar to the last.
# make a function that can answer both questions


In [None]:
#
# 9)
# Use your function to ask the same question about
# Comission jobs


In [None]:
#
# 10)
# Let's call jobs that are either Temporary, 
# Part-time or Internships "precarious". 
#
# Order the industries (category) by the 
# percentage of precarious jobs
#
# HINT: can you modify some previous function
# to make this question easy to answer?
#
# HINT: Make sure your variables reflect their
# content. Collections should be plural, single
# elements should be singular.


In [None]:
#
# 11)
# Get the 5 companies who post the most jobs
# in each  category, along with the number of 
# jobs listed by each company.


In [None]:
# 12)
# Is any company in the top 5 across more than one categories??
# Return the companies who are, along with the categories
# in which they appear in the top 5.
#
# FORMAT: Dataframe with 3 columns: company, category, number of jobs
#
# HINT: take a look at the `.filter` method on GroupBy:
# https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.core.groupby.DataFrameGroupBy.filter.html
