<a href="https://colab.research.google.com/github/hazelkimm/hazelkimm/blob/main/Lab_1_Intro_to_NumPy_and_Pandas_2022.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Introduction to Python**

<img src='https://drive.google.com/uc?id=17eYO8zCyArTrt81izCWXCg5JvaaCLO46'>

Python is one of the most popular programming languages in the world. It is a general-purpose programming language, which means that it has a wide variety of applications including software development, web development and data science. We are interested in the application of Python in data science.

Python is extremely useful for data science as it makes it easy to read, clean, manipulate, and analyze data. You can also visualize your data using graphs and maps, create a classifier (e.g. for classifying patients based on whether or not they have heart disease), build a reommendation system (e.g. for recommending a new song based on your favorite songs) and more. Therefore, knowing how to do data science in Python is a powerful skill.

## **Terminology**

Let's start by learning some basic Python terminology.

- **Syntax**: Syntax refers to the format of the code you are writing. You must format your code in a specific way to make sure that it's error free.

- **Function**: Functions are commands that take in some input and give you some output. For example, sum() is a function that takes in a set of numbers as the input and give you the sum of those numbers as the output. Note that not all functions take in an input (you'll learn more about this later).

- **Argument**: Arguments are the inputs of your function. For instance, in sum([13, 20]), the list [13, 20] is the argument of the function sum().

- **Library**: You can think of libraries as a collection of functions for a specific purpose.

## **Writing Code**

For our project, we'll be using Google Colab to write code so you don't have to download the Python software to write and run code. Google Colab allows you to do this on your browser. All you have to do is create a code cell, write and press Shift + Enter to run it.


# **Introduction to NumPy**

<img src='https://drive.google.com/uc?id=1yvRF_75z5FbfUWsZwy4QaLhI-Q32oJgH'>

Numpy is a Python library used for managing arrays and performing mathematical operations. We start by importing the NumPy library.

In [None]:
import numpy as np

In the above line of code, we imported the NumPy library. The 'as np' expression in the import statement allows us to name the library so that it's easier to use. To use a function in the library, you have to include 'np.' as the prefix (e.g. np.mean).

However, if you want to use the functions in the library without having to include the prefix, use this import statement:

In [None]:
from numpy import *

In the above line, we are basically saying that we want to import all the functions from the numpy module. This saves us the need to use the 'np.' prefix before the NumPy functions.



## **Functions**

Recall that functions are commands that take in certain inputs to generate an output. Here's a diagram that summarizes functions:

<img src='https://drive.google.com/uc?id=1ZNWEGYbdnnUT1aCnqkvUuWytALbBQvC1'>

A special type of functions are methods. Methods are functions that must be used on a particular object (e.g. tables).

Some functions are built-in while others must be imported. Built-in functions are functions that are always available in Python. You don't have to import any libraries in order to use them. An example is sum().

On the other hand, some functions can only be used after you import certain libraries. An example is NumPy, which contains various data science functions such as np.mean and np.array. You cannot use these functions without first importing the NumPy library as they are not built-in.



## **Defining Functions**

Python has tons of functions but sometimes, you might want to create your own function to perform a specific task. Python also allows you to define your own functions using the def function.

This is the syntax for defining a function:

```
def *function_name(arguments)*:
  '''description'''
  *expression*
  return *output*
```

Note that the description is optional but you can include it to let other people know what your function does.

The print() function is used for displaying output. However, when we are defining a function, we use the return function to display the output. The return function is specifically used at the end of a function you defined to make sure that your function provides an output. If you are printing anything in your function, you don't have to use the return command (see the following example).


Say you have a list of grades and you want to know the percentage of each grade (A, B and C) in the list. You make this process simple by defining a function that takes a list of grades as its input and gives you the percentages of each of the 3 grades as the output.

In [None]:
#Defining the function

def grade_percentages(grades_list):
  '''Function that calculates the percentage of each grade in a list of grades.'''
  
  counter_a = 0
  counter_b = 0
  counter_c = 0
  other = 0

  for grade in grades_list:
    if grade == 'a':
      counter_a = counter_a + 1
    elif grade == 'b':
      counter_b = counter_b + 1
    elif grade == 'c':
      counter_c = counter_c + 1
    else:
      other = other + 1

  #The len() function is used for finding the total number of items in a list.
  list_len = len(grades_list)
  prop_a = 100*(counter_a/list_len)
  prop_b = 100*(counter_b/list_len)
  prop_c = 100*(counter_c/list_len)
  prop_other = 100*(other/list_len)

  print("% of As:", prop_a)
  print("% of Bs:", prop_b)
  print("% of Cs:", prop_c)
  print("% of other grades:", prop_other)

#Trying out our function
grades_list = ['a', 'a', 'b', 'b', 'b', 'a', 'a', 'c', 'b', 'c', 'c', 'f']
grade_percentages(grades_list)

## **Storing Data**

Storing data is important as it allows you to use the data later. The most basic way to store data in Python is to assign it to a variable. There are also more complex way of storing and organizing data, such as:

- Lists: A list in Python is just a sequence of values. These values can be of any data type (strings, integers, etc.).

- Arrays: An array is also a sequence of values of any data type. However, there are important differences between lists and arrays, which we will be discussing below.

- Tables: You can think of tables as a collection of lists or arrays. They are organized in columns and rows. Columns give you information about one aspect of all your entries (for example, names of all students). Rows give you information about all aspects of one entry (for example, all the information – such as name, student ID, email, etc. – of one student). In Python, tables can be created using the DataFrame object.

Let's start by talking about lists.

### **Lists**

Creating a list in Python is very easy – just put your values within square brackets ([]).

In [None]:
[3, 4, 6, 100] 

#### **Lists vs. Arrays**

Lists store a set of values, similar to arrays but they are both very different objects. The following are the main differences between lists and arrays (don't worry about how to create arrays just yet).

- A particular list can contain items of different data types. For example, a list can have a string, integer, float and Boolean value. However, a particular array can only have items of one data type. 

In [None]:
#Example 1
np.array([1, 2.2, 'hello'])

array(['1', '2.2', 'hello'], dtype='<U32')

Above, when we tried to create arrays with items of different data types, Python automatically converted all the items to the same data type based on what makes the most sense. In examples 1 and 3, because there was a string in the array, Python converted all the other items to strings as well. In example 2, because there was a float in the array, Python converted all the remaining items to floats as well (the Boolean value True is encoded as 1 and False is encoded as 0)

### **List Comprehensions**

A list comprehension is a simple way to create a new list based on an existing list. It involves using conditional statements and loops.

Say you have a list of numbers and you want to create a new list by doubling all numbers greater than 5 in the existing list.

In [None]:
numbers = [4, 13, 18, 3, 25, -2, 31]

Using list comprehensions, you can perform this task much more efficiently, using just one line of code.

This is the syntax for creating a list comprehension:

[*expression* for *item* in *list* if *condition*]

In [None]:
numbers = [4, 13, 18, 3, 25, -2, 31]
new_numbers = []

[num * 2 for num in numbers if num > 5]

[26, 36, 50, 62]

## **Arrays**

An array, like a list, is a collections of values (see [lab 0](https://https://colab.research.google.com/drive/1_u1YSTi8YX93XctMZwJfYgDST4Ofbyqn#scrollTo=59g2n6VzP4IW) for the difference between arrays and lists). One of the most important things that NumPy is used for is creating and managing arrays.

### **Creating Arrays**

We use the np.array() function to create arrays using NumPy. It takes values within square brackets as its argument.

In [None]:
num = np.array([7, 5, 9, 2, 5, 21, 13, 3])
num

array([ 7,  5,  9,  2,  5, 21, 13,  3])

Another way to create arrays is using the np.arange() function. This function can only be used to create arrays of numbers (integers and floats). You have to specify the initial and last values, and the interval and the function creates an array of values for you. The last value is not included.

In [None]:
np.arange(1, 20, 3)

array([ 1,  4,  7, 10, 13, 16, 19])

**Question 1:** Use np.arange to create an array of all the multiples of 5 between 0 and 100 (inclusive).

### **Array Dimensions**

Arrays can be of different dimensions depending on the number of elements they have. A zero-dimensional array has only one element. They are also called scalars. Since there is only one element, you don't have to use square brackets.

In [None]:
#0D array
np.array(67)

array(67)

One-dimensional arrays contain a single collection of multiple elements. The num array is an example of a 1D array.

In [None]:
#1D array
num

array([ 7,  5,  9,  2,  5, 21, 13,  3])

Two-dimensional arrays contain two collections of multiple elements. These two collections must be enclosed within square brackets.

2D arrays are especially useful for storing x and y values of data points. Arrays with 2 dimensions or more are also useful for vector algebra.

In our labs, we will mainly be using 1D arrays as examples.

In [None]:
#2D array
num_2d = np.array([[1, 4, 5], [5, 2, 0]])
num_2d

array([[1, 4, 5],
       [5, 2, 0]])

To check the number of dimensions an array has, you can use the .ndim function.

In [None]:
num_2d.ndim

2

**Question 2:** Say you want to know if TV shows with fewer seasons have higher overall ratings. You record the number of seasons and overall rating for 7 TV shows you randomly selected. 

  - Number of seasons: 4, 2, 7, 7, 6, 16, 10
  - Overall rating (out of 10): 6.7, 8.1, 7.8, 9.2, 9.5, 3.5, 6.3

Create **a single** array to store this information (store the array in a variable).

*Hint: Think about how many dimensions the array should have.*

### **Indexing and Slicing**

Indexing refers to extracting items in an array based on their index. You can do this using sqaure brackets at the end of the array. Recall that Python uses 0-indexing, so the index of the first element in your array is 0, the index of the second element is 1 and so on.

In [None]:
#This will give you the 1st element in the num array.
num[0]

7

In [None]:
#This will give you the 3rd element in the num array.
num[2]

9

To index arrays of multiple dimensions, we can just use multiple square brackets.

In [None]:
#This will give you the 2nd element in the 1st collection.
num_2d[0][2]

5

We use negative indexing to access elements from the end of the array.

In [None]:
#This will give you the last element in your array.
num[-1]

3

In [None]:
#This will give you 3rd to last element in your array.
num[-3]

21

Slicing refers to extracting a portion of your array using its index.

In [None]:
#This gives you the first 3 elements in your array.
num[0:3]

array([7, 5, 9])

Keep in mind that the element in the last index is not included. In this case, the element with index 3 (so the 4th element) is not included.

In [None]:
#This gives you every 2nd element in the array starting from the 1st element.
num[::2]

array([ 7,  9,  5, 13])

In [None]:
#This gives you all the elements in the array.
num[::]

array([ 7,  5,  9,  2,  5, 21, 13,  3])

In [None]:
#This gives you every 3rd element in the array between the 2nd and 5th items.
num[1:5:3]

array([5, 5])

Lists and arrays are indexed and sliced in the same way so you can check out [lab 0](https://https://colab.research.google.com/drive/1_u1YSTi8YX93XctMZwJfYgDST4Ofbyqn#scrollTo=mczvTogmbvOj) for more information on indexing and slicing. You can also experiment to discover different ways of slicing and indexing lists.

To slice arrays of different dimensions, you can use multiple square brackets.

In [None]:
num_2d[0][0:2:1]

array([1, 4])

Another way of slicing arrays is by creating an array of indices.

In [None]:
#This will give you the 2nd, 5th and 6th elements in the array.
index = np.array([1, 4, 5])
num[index]

array([ 5,  5, 21])

**Question 3:** Now, you want to use the array you created in question 2 to create two separate arrays: one that stores the number of seasons and another that stores the overall rating. Use indexing to create these two arrays.

## **Math in NumPy**

NumPy is also really useful for math. It contains various function for performing math operations. Here are some important examples.

In [None]:
#Sum
np.sum(num)

65

In [None]:
#Average
np.mean(num)

8.125

In [None]:
#Another function to find the average
np.average(num)

8.125

The np.round function is used to round values. The first argument in the np.round function is the number you would like to round and the second argument is the number of decimal places to round to.


In [None]:
#Rounding
np.round(4.5234, 2)

4.52

You can also input an array. The function will perform an elementwise operation – that is, it will round each of the values in the array.

In [None]:
np.round(num)

array([ 7,  5,  9,  2,  5, 21, 13,  3])

Another useful NumPy function is np.diff(), which takes the difference between consecutive elements. A given element is subtracted from the one to its right.

In [None]:
np.diff(num)

array([ -2,   4,  -7,   3,  16,  -8, -10])

**Question 4:** Find the average overall rating for the TV shows in the sample (use the array in question 3).

# **Introduction to Pandas**

<img src='https://drive.google.com/uc?id=1Elaps1FSXJGcuLEdl9xiBTV9WRAWkDoy'>

Pandas is another Python library used for data science. It is especially useful for creating and managing dataframes (which are basically tables).

To use the functions in the Pandas library, we first need to import it. We will be importing it as pd (similar to how we import NumPy as np).

In [None]:
import pandas as pd

## **Importing Datasets**

One of the most important Pandas functions is pd.read_csv(). It is used for upolading .csv files (.csv is a file extension, similar to .docx). This is one of the most common file extensions for datasets. The function automatically converts your .csv file to a dataframe object.

Below, we are importing a dataset on different college majors and salaries. We will be using this dataset throughout this lab.

Importing a dataset into Google Colab requires a few more steps. One way of doing this is by connecting your Google account to the notebook using this code:

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


When you run the above cell, you will be asked to authenticate your account. Once you've chosen your Google account and copied the verification code, go to the Files section (it is the last icon in the sidebar at the left) then open your Drive. Upload your dataset to the Drive of the account you just connected, click on the 3 vertical dots and select 'Copy path.'

Assign the path to a variable (as we have done below, in the first line). Then, use the pd.read_csv() function to upload your dataset.

In [None]:
path = "/content/drive/MyDrive/Copy of salary_information.csv"
salary_df = pd.read_csv(path)
salary_df.head(5)

Unnamed: 0,undergrad_major,starting_salary,midcareer_salary,percent_change_starting,midcareer_tenth_percentile,midcareer_twentyfifth_percentile,midcareer_seventyfifth_percentile,midcareer_nintieth_percentile,Unnamed: 8
0,Accounting,"$46,000.00","$77,100.00",67.6,"$42,200.00","$56,100.00","$108,000.00","$152,000.00",
1,Aerospace Engineering,"$57,700.00","$101,000.00",75.0,"$64,300.00","$82,100.00","$127,000.00","$161,000.00",
2,Agriculture,"$42,600.00","$71,900.00",68.8,"$36,300.00","$52,100.00","$96,300.00","$150,000.00",
3,Anthropology,"$36,800.00","$61,500.00",67.1,"$33,800.00","$45,500.00","$89,300.00","$138,000.00",
4,Architecture,"$41,600.00","$76,800.00",84.6,"$50,600.00","$62,200.00","$97,000.00","$136,000.00",


The .head() method is used for displaying the first few rows of your dataframe. The argument is the number of rows you'd like to see (in this case, 5). You can use the .tail() method to display the last rows of your dataframe.

## **Series**

Series, similar to arrays and lists, are used for storing a collection of elements. Series are different from arrays and lists in one important way: you can create your own index for the elements in a series. However, you cannot change the indices of the elements in an array or list: you have to access them using their integer index.

Series are similar to lists as they are heterogenous, which means that a single series or list can hold data of different types (integers, floats, etc.). Arrays, on the other hand, are homogenous, so a particular array can only hold data of a single type.

### **Creating Series**

The pd.Series() function is used for creating series (note that the series has to be capitalized as Python is case-sensitive). There are 2 ways of creating series:

- Convert arrays to series
- Create a new series

The first argument in the pd.Series() function is the collection of elements within squarebrackets. If you want to convert an array to series, you can just enter the name of the array as the first argument. The second argument is optional and allows you to change the indices of the elements in your series. If you don't specify this argument, the default indices will be used, as in the example below:

In [None]:
#Converting arrays to series
num_series = pd.Series(num)
num_series

0     7
1     5
2     9
3     2
4     5
5    21
6    13
7     3
dtype: int64

You can specify the indices within square brackets, as shown below. Note that the number of indices you specify should be the same as the number of elements in your series.

In [None]:
#Create a new series
series = pd.Series([90, 50, 73, 81], index = [1, 2, 3, 4])
series

1    90
2    50
3    73
4    81
dtype: int64

You can specify anything as the indices of your series, even strings.

In [None]:
series_str = pd.Series([90, 50, 73, 81], index = ["one", "two", "three", "four"])
series_str

one      90
two      50
three    73
four     81
dtype: int64

## **Dictionaries**

Dictionaries in Python are a collections of elements in which each element is given a unique key to identify it. You can use these keys to access your items. It's up to you to decide the keys of your dictionary.

Here's an example of a dictionary. Say you keep track of the classes you're taking on 3 days of the week. You can create a dictionary to store this information.

In [None]:
class_tracker = {
    'monday': ["data 8", "econ 100", "stat 135"],
    'tuesday': ["econ 100", "math 53"],
    'wednesday': ["philos 12a", "data 8", "stat 135"]
}
class_tracker

{'monday': ['data 8', 'econ 100', 'stat 135'],
 'tuesday': ['econ 100', 'math 53'],
 'wednesday': ['philos 12a', 'data 8', 'stat 135']}

To add items to your dictionary, you can just do dict[key]: value. Suppose you want to add your schedule of classes for Thursday:

In [None]:
class_tracker["thursday"] = ["math 53", "philos 12a"]
class_tracker

{'monday': ['data 8', 'econ 100', 'stat 135'],
 'thursday': ['math 53', 'philos 12a'],
 'tuesday': ['econ 100', 'math 53'],
 'wednesday': ['philos 12a', 'data 8', 'stat 135']}

You can use keys to access elements from your dictionaries. For example, to print out the classes you have on Tuesday, you can just do this:

In [None]:
class_tracker["tuesday"]

['econ 100', 'math 53']

**Question 5:** Issac wants to track information about the classes he's taking next semester. He's taking 4 classes (in brackets is the class code): 

  - Introduction to Microeconomics (ECON 100)
  - Computer Science (COMPSCI 110)
  - 3D Printing (ART 10)
  - Probability and Statistics (STAT 130)

He's enrolled in the economics and statistics classes, but is waitlisted in the other classes. He's happy that the computer science and 3D printnig classes have a maximum class size of 20 but is worried that the other classes are bigger, with a maximum size of 50.

- He wants to create a dictionary to store the class name, class code, whether he's enrolled or waitlisted and maximum class size of each class. Create this dictionary.
-  Issac wants to know the codes of the classes he's taking. Use the dictionary you just created to retrieve the class codes.

## **Dataframes**

A dataframe is an object in Python that allows you to create tables. You can think of the rows and columns in a dataframe as a collection of series. 

### **Creating DataFrames**

You can either import a dataset using the pd.read_csv() function (which automatically converts your .csv file to a dataframe) or you can create a dataframe from scratch. The pd.DataFrame() function is used to create a dataframe. There are different ways of creating a dataframe using this function:

- Creating a dataframe from lists
- Creating a dataframe from arrays
- Creating a dataframe from series
- Creating a dataframe from dictionaries

Let's look at each option.

The first way to create a dataframe is by creating a nested list (that is, lists within a list). Each list in the nested list represents a row. Below, student_info is an example of a nested list.

Then, pass this list as the first argument in the pd.DataFrame function. The second argument, which is optional, is the names of your columns. If you don't specify anything, Python will just set integers starting from 0 as the column names.

In [None]:
student_info = [[1011, "Andy", "9th", True], [1012, "Isabel", "10th", False]]
student_df = pd.DataFrame(student_info, columns = ["Student ID", "First Name", "Grade", "Enrolled?"])
student_df

Unnamed: 0,Student ID,First Name,Grade,Enrolled?
0,1011,Andy,9th,True
1,1012,Isabel,10th,False


Every row in a dataframe has an index, as you can see above. You can set your own custom index. 

In [None]:
student_df = pd.DataFrame(student_info, columns = ["Student ID", "First Name", "Grade", "Enrolled?"], index = ["Student 1", "Student 2"])
student_df

Unnamed: 0,Student ID,First Name,Grade,Enrolled?
Student 1,1011,Andy,9th,True
Student 2,1012,Isabel,10th,False


You can also create a dataframe using arrays. Since a particular array can only hold data of a single type, you can use it as a column for your dataframe.

In [None]:
#Arrays of columns – each array is a column in your dataframe
student_id = np.array([1011, 1012])
names = np.array(["Andy", "Isabel"])
grade = np.array(["9th", "10th"])
enrolled = np.array([True, False])

#Creating the dataframe
student_df_2 = pd.DataFrame({"Student ID": student_id, "First Name": names, "Grade": grade, "Enrolled?": enrolled}, index = ["Student 1", "Student 2"])
student_df_2

Unnamed: 0,Student ID,First Name,Grade,Enrolled?
Student 1,1011,Andy,9th,True
Student 2,1012,Isabel,10th,False


You can also create series and pass them as columns in your dataframe. This process is similar to using arrays to create a dataframe. Note: don't specify the index in the pd.DataFrame function as your values won't be shown if you do. If you want to change the index, change it in the series.

In [None]:
#Series of columns – each series is a column in your dataframe
student_id_series = pd.Series([1011, 1012])
names_series = pd.Series(["Andy", "Lana"])
grade_series = pd.Series(["9th", "10th"])
enrolled_series = pd.Series([True, False])

#Creating the dataframe
student_df_3 = pd.DataFrame({"Student ID": student_id_series, "First Name": names_series, "Grade": grade_series, "Enrolled?": enrolled_series})
student_df_3

Unnamed: 0,Student ID,First Name,Grade,Enrolled?
0,1011,Andy,9th,True
1,1012,Lana,10th,False


The final way to create a dataframe is using a dictionary. You don't have to specify column names when you use this method as your dictionary keys will be the column names.

In [None]:
student_dict = {
    "Student ID": [1011, 1012],
    "First Name": ["Andy", "Lana"],
    "Grade": ["9th", "10th"],
    "Enrolled?": [True, False]
}

student_df_4 = pd.DataFrame(student_dict, index = ["Student 1", "Student 2"])
student_df_4

Unnamed: 0,Student ID,First Name,Grade,Enrolled?
Student 1,1011,Andy,9th,True
Student 2,1012,Lana,10th,False


**Question 6:** Issac decides that it's better to store this information as a dataframe. Convert the dictionary to a dataframe.

### **Data Cleaning**

Generally, you cannot immediately start using a dataset right after you uploaded it to Python as there may be a lot of things that are wrong with it. Therefore, we have to first clean the dataset before we start using it.

In our salaries dataset, we have to fix the following things:

1. Converting the values in each column from strings to floats 
2. Removing NaN values

**Converting Strings to Floats**

If you check the data type of a value in one of the numeric columns, you will see that it is actually a string rather than a number.

In [None]:
type(salary_df["starting_salary"][0])

str

Note: `df["column_name"]` allows you to select a column in your dataframe. You can then use indexing to select a particular value in this column.

This means that we cannot perform mathematical operations on our data (such as taking the mean). Hence, we have to convert the values to numbers.

To do this, we first want to remove any symbols (such as commas and dollar signs) using the .replace() function then convert the string to float using the float() function. Since we have to do this repeatedly on multiple values, we can define a function to make things easier.

In [None]:
#String conversion function
def string_convert(string):
  remove_comma = string.replace(",", "")
  remove_dollar = remove_comma.replace("$", "")
  to_float = float(remove_dollar)
  return to_float

#Checking if our function works
salary_df["starting_salary"] = salary_df["starting_salary"].apply(string_convert)
type(salary_df["starting_salary"][0])

AttributeError: ignored

We have successfully converted the strings in the starting_salary column to numbers! Let's do the asme for all the other columns.

In [None]:
#Finding all the columns in the dataset
salary_df.columns

Index(['undergrad_major', 'starting_salary', 'midcareer_salary',
       'percent_change_starting', 'midcareer_tenth_percentile',
       'midcareer_twentyfifth_percentile', 'midcareer_seventyfifth_percentile',
       'midcareer_nintieth_percentile', 'Unnamed: 8'],
      dtype='object')

In [None]:
salary_df["midcareer_salary"] = salary_df["midcareer_salary"].apply(string_convert)
salary_df["percent_change_starting"] = salary_df["percent_change_starting"].apply(string_convert)
salary_df["midcareer_tenth_percentile"] = salary_df["midcareer_tenth_percentile"].apply(string_convert)
salary_df["midcareer_twentyfifth_percentile"] = salary_df["midcareer_twentyfifth_percentile"].apply(string_convert)
salary_df["midcareer_seventyfifth_percentile"] = salary_df["midcareer_seventyfifth_percentile"].apply(string_convert)
salary_df["midcareer_nintieth_percentile"] = salary_df["midcareer_nintieth_percentile"].apply(string_convert)

Note: we have to specify `salary_df["column_name"] = salary_df["column_name"].apply(string_convert)` to make sure that the changes we made are saved and reflectd in our dataframe. Otherwise, the conversion from string to float will not be shown in the dataframe.

We're done with the first part of our data cleaning!

**Removing NaN Values**

You can drop all the NaN values in your dataset using the .dropna() function. However, we will be removing the last column, which only contains NaN values, rather than just dropping NaN values. To remove a column in a dataframe, we use the .drop() function.

In [None]:
salary_df.head(4)

Unnamed: 0,undergrad_major,starting_salary,midcareer_salary,percent_change_starting,midcareer_tenth_percentile,midcareer_twentyfifth_percentile,midcareer_seventyfifth_percentile,midcareer_nintieth_percentile,Unnamed: 8
0,Accounting,46000.0,77100.0,67.6,42200.0,56100.0,108000.0,152000.0,
1,Aerospace Engineering,57700.0,101000.0,75.0,64300.0,82100.0,127000.0,161000.0,
2,Agriculture,42600.0,71900.0,68.8,36300.0,52100.0,96300.0,150000.0,
3,Anthropology,36800.0,61500.0,67.1,33800.0,45500.0,89300.0,138000.0,


In [None]:
salary_df = salary_df.drop(columns = ["Unnamed: 8"])
salary_df.columns

Index(['undergrad_major', 'starting_salary', 'midcareer_salary',
       'percent_change_starting', 'midcareer_tenth_percentile',
       'midcareer_twentyfifth_percentile', 'midcareer_seventyfifth_percentile',
       'midcareer_nintieth_percentile'],
      dtype='object')

### **Dataframe Functions**

There are a ton of important dataframe functions but in this section, we'll just talk about a few important ones. 

**Summary Functions**

First, we'll be talking about the functions we can use for extracting information from our dataset. 

- df.info() - This function gives you a summary of your dataset, including the column names, data types, etc. 
- df.shape - This gives you the dimensions of your dataframe in this format: (rows, columns).
- df.columns - This gives you the names of the columns in your dataset in the order in which they appear.

Let's apply these functions on the salary dataset.

In [None]:
salary_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 50 entries, 0 to 49
Data columns (total 8 columns):
 #   Column                             Non-Null Count  Dtype  
---  ------                             --------------  -----  
 0   undergrad_major                    50 non-null     object 
 1   starting_salary                    50 non-null     float64
 2   midcareer_salary                   50 non-null     float64
 3   percent_change_starting            50 non-null     float64
 4   midcareer_tenth_percentile         50 non-null     float64
 5   midcareer_twentyfifth_percentile   50 non-null     float64
 6   midcareer_seventyfifth_percentile  50 non-null     float64
 7   midcareer_nintieth_percentile      50 non-null     float64
dtypes: float64(7), object(1)
memory usage: 3.2+ KB


In [None]:
salary_df.shape

(50, 8)

In [None]:
salary_df.columns

Index(['undergrad_major', 'starting_salary', 'midcareer_salary',
       'percent_change_starting', 'midcareer_tenth_percentile',
       'midcareer_twentyfifth_percentile', 'midcareer_seventyfifth_percentile',
       'midcareer_nintieth_percentile'],
      dtype='object')

**Other Functions**

Here, we'll be talking about other important dataframe functions that you'll likely use for your project.

- df.describe() - This gives you a statistical summary of the numerical columns (that is, columns whose data type is integer or float) in your dataset. For each of the numerical columns, you will get mean, standard deviation, etc. 

- df.groupby() - This function allows you to group yoru data into different categories and perform operations on those categories.

- df.sort_values() - This is used for sorting columns in ascending or descending order.

In [None]:
salary_df.describe()

Unnamed: 0,starting_salary,midcareer_salary,percent_change_starting,midcareer_tenth_percentile,midcareer_twentyfifth_percentile,midcareer_seventyfifth_percentile,midcareer_nintieth_percentile
count,50.0,50.0,50.0,50.0,50.0,50.0,50.0
mean,44310.0,74786.0,69.274,43408.0,55988.0,102138.0,142766.0
std,9360.866217,16088.40386,17.909908,12000.779567,13936.951911,20636.789914,27851.249267
min,34000.0,52000.0,23.4,26700.0,36500.0,70500.0,96400.0
25%,37050.0,60825.0,59.125,34825.0,44975.0,83275.0,124250.0
50%,40850.0,72000.0,67.8,39400.0,52450.0,99400.0,145500.0
75%,49875.0,88750.0,82.425,49850.0,63700.0,118750.0,161750.0
max,74300.0,107000.0,103.5,71900.0,87300.0,145000.0,210000.0


We can't use the groupby() function on the `salary_df` dataframe as the majors column already has only unique values. So, we can load a new dataset.

In [None]:
import seaborn as sns
df = sns.load_dataset('iris')
df.head(5)

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width,species
0,5.1,3.5,1.4,0.2,setosa
1,4.9,3.0,1.4,0.2,setosa
2,4.7,3.2,1.3,0.2,setosa
3,4.6,3.1,1.5,0.2,setosa
4,5.0,3.6,1.4,0.2,setosa


In [None]:
#This gives the mean for each species.
df.groupby("species").mean()

Unnamed: 0_level_0,sepal_length,sepal_width,petal_length,petal_width
species,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
setosa,5.006,3.428,1.462,0.246
versicolor,5.936,2.77,4.26,1.326
virginica,6.588,2.974,5.552,2.026


In [None]:
#This sorts the entire dataset in descending order of midcareer salary.
salary_df.sort_values(by = "midcareer_salary", ascending = False).head(10)

Unnamed: 0,undergrad_major,starting_salary,midcareer_salary,percent_change_starting,midcareer_tenth_percentile,midcareer_twentyfifth_percentile,midcareer_seventyfifth_percentile,midcareer_nintieth_percentile
8,Chemical Engineering,63200.0,107000.0,69.3,71900.0,87300.0,143000.0,194000.0
12,Computer Engineering,61400.0,105000.0,71.0,66100.0,84100.0,135000.0,162000.0
19,Electrical Engineering,60900.0,103000.0,69.1,69300.0,83800.0,130000.0,168000.0
1,Aerospace Engineering,57700.0,101000.0,75.0,64300.0,82100.0,127000.0,161000.0
17,Economics,50100.0,98600.0,96.8,50600.0,70600.0,145000.0,210000.0
44,Physics,50300.0,97300.0,93.4,56000.0,74200.0,132000.0,178000.0
13,Computer Science,55900.0,95500.0,70.8,56000.0,74900.0,122000.0,154000.0
30,Industrial Engineering,57700.0,94700.0,64.1,57100.0,72300.0,132000.0,173000.0
38,Mechanical Engineering,57900.0,93600.0,61.7,63700.0,76200.0,120000.0,163000.0
37,Math,45400.0,92400.0,103.5,45200.0,64200.0,128000.0,183000.0


### **Indexing, Slicing and Subsetting Dataframes**

You can access values in your dataset using indexing, slicing and subsetting.

This is how you can select one column in your dataset.

In [None]:
salary_df["undergrad_major"]

0                               Accounting 
1                    Aerospace Engineering 
2                              Agriculture 
3                             Anthropology 
4                             Architecture 
5                              Art History 
6                                  Biology 
7                      Business Management 
8                     Chemical Engineering 
9                                Chemistry 
10                       Civil Engineering 
11                          Communications 
12                    Computer Engineering 
13                        Computer Science 
14                            Construction 
15                        Criminal Justice 
16                                   Drama 
17                               Economics 
18                               Education 
19                  Electrical Engineering 
20                                 English 
21                                    Film 
22                              

This is how you can select multiple columns in your dataset.

In [None]:
salary_df[["undergrad_major", "starting_salary"]]

Unnamed: 0,undergrad_major,starting_salary
0,Accounting,46000.0
1,Aerospace Engineering,57700.0
2,Agriculture,42600.0
3,Anthropology,36800.0
4,Architecture,41600.0
5,Art History,35800.0
6,Biology,38800.0
7,Business Management,43000.0
8,Chemical Engineering,63200.0
9,Chemistry,42600.0


You can use indexing to select rows. This is the format:

```
df[start_value: end_value]
```

Note that row indices start from 0 (similar to indices fo arrays and lists). The row in the end value is not included. 

In [None]:
#This will give you the first row in your dataframe.
salary_df[0:1]

Unnamed: 0,undergrad_major,starting_salary,midcareer_salary,percent_change_starting,midcareer_tenth_percentile,midcareer_twentyfifth_percentile,midcareer_seventyfifth_percentile,midcareer_nintieth_percentile
0,Accounting,46000.0,77100.0,67.6,42200.0,56100.0,108000.0,152000.0


In [None]:
#This will give you the first 4 rows in your dataframe.
salary_df[0:4]

Unnamed: 0,undergrad_major,starting_salary,midcareer_salary,percent_change_starting,midcareer_tenth_percentile,midcareer_twentyfifth_percentile,midcareer_seventyfifth_percentile,midcareer_nintieth_percentile
0,Accounting,46000.0,77100.0,67.6,42200.0,56100.0,108000.0,152000.0
1,Aerospace Engineering,57700.0,101000.0,75.0,64300.0,82100.0,127000.0,161000.0
2,Agriculture,42600.0,71900.0,68.8,36300.0,52100.0,96300.0,150000.0
3,Anthropology,36800.0,61500.0,67.1,33800.0,45500.0,89300.0,138000.0


You can also include a step value.

In [None]:
#This will give you every 2nd row between the 1st and 10th rows.
salary_df[0:10:2]

Unnamed: 0,undergrad_major,starting_salary,midcareer_salary,percent_change_starting,midcareer_tenth_percentile,midcareer_twentyfifth_percentile,midcareer_seventyfifth_percentile,midcareer_nintieth_percentile
0,Accounting,46000.0,77100.0,67.6,42200.0,56100.0,108000.0,152000.0
2,Agriculture,42600.0,71900.0,68.8,36300.0,52100.0,96300.0,150000.0
4,Architecture,41600.0,76800.0,84.6,50600.0,62200.0,97000.0,136000.0
6,Biology,38800.0,64800.0,67.0,36900.0,47400.0,94500.0,135000.0
8,Chemical Engineering,63200.0,107000.0,69.3,71900.0,87300.0,143000.0,194000.0


Negative indexing allows you to select rows from the end of your dataframe.

In [None]:
#This will give you the last row in your dataframe.
salary_df[-2:-1]

Unnamed: 0,undergrad_major,starting_salary,midcareer_salary,percent_change_starting,midcareer_tenth_percentile,midcareer_twentyfifth_percentile,midcareer_seventyfifth_percentile,midcareer_nintieth_percentile
48,Sociology,36500.0,58200.0,59.5,30700.0,40400.0,81200.0,118000.0


Two functions that are useful for indexing and slicing are .loc() and .iloc(). The first argument is the row(s) you would like to select and the second argument is the column(s) you would like to select. The .loc() function is mainly used for slicing based on labels (row and column names). The .iloc() function is mainly used for slicing based on indices.

In [None]:
#This allows you to select the first column.
salary_df.loc[:, "undergrad_major"]

0                               Accounting 
1                    Aerospace Engineering 
2                              Agriculture 
3                             Anthropology 
4                             Architecture 
5                              Art History 
6                                  Biology 
7                      Business Management 
8                     Chemical Engineering 
9                                Chemistry 
10                       Civil Engineering 
11                          Communications 
12                    Computer Engineering 
13                        Computer Science 
14                            Construction 
15                        Criminal Justice 
16                                   Drama 
17                               Economics 
18                               Education 
19                  Electrical Engineering 
20                                 English 
21                                    Film 
22                              

In [None]:
#This will give you the first 5 columns for the 2 specified columns.
salary_df.loc[0:4, ["undergrad_major", "starting_salary"]]

Unnamed: 0,undergrad_major,starting_salary
0,Accounting,46000.0
1,Aerospace Engineering,57700.0
2,Agriculture,42600.0
3,Anthropology,36800.0
4,Architecture,41600.0


In [None]:
#This will give you the first 4 rows of the first 3 columns.
salary_df.iloc[0:4, 0:3]

Unnamed: 0,undergrad_major,starting_salary,midcareer_salary
0,Accounting,46000.0,77100.0
1,Aerospace Engineering,57700.0,101000.0
2,Agriculture,42600.0,71900.0
3,Anthropology,36800.0,61500.0


You can also subset dataframes using comparison operators like > and =. This allows you to select values in a column that match a particular condition.

In [None]:
#This will show values in your dataframe for which the midcareer salary is greater than $30,000.
salary_df[salary_df["midcareer_salary"] > 30000].head(5)

Unnamed: 0,undergrad_major,starting_salary,midcareer_salary,percent_change_starting,midcareer_tenth_percentile,midcareer_twentyfifth_percentile,midcareer_seventyfifth_percentile,midcareer_nintieth_percentile
0,Accounting,46000.0,77100.0,67.6,42200.0,56100.0,108000.0,152000.0
1,Aerospace Engineering,57700.0,101000.0,75.0,64300.0,82100.0,127000.0,161000.0
2,Agriculture,42600.0,71900.0,68.8,36300.0,52100.0,96300.0,150000.0
3,Anthropology,36800.0,61500.0,67.1,33800.0,45500.0,89300.0,138000.0
4,Architecture,41600.0,76800.0,84.6,50600.0,62200.0,97000.0,136000.0


Another way of doing this:

In [None]:
#This will show values in your dataframe for which the starting salary is greater than $50,000.
salary_df[salary_df.starting_salary > 50000].head(5)

Unnamed: 0,undergrad_major,starting_salary,midcareer_salary,percent_change_starting,midcareer_tenth_percentile,midcareer_twentyfifth_percentile,midcareer_seventyfifth_percentile,midcareer_nintieth_percentile
1,Aerospace Engineering,57700.0,101000.0,75.0,64300.0,82100.0,127000.0,161000.0
8,Chemical Engineering,63200.0,107000.0,69.3,71900.0,87300.0,143000.0,194000.0
10,Civil Engineering,53900.0,90500.0,67.9,63400.0,75100.0,115000.0,148000.0
12,Computer Engineering,61400.0,105000.0,71.0,66100.0,84100.0,135000.0,162000.0
13,Computer Science,55900.0,95500.0,70.8,56000.0,74900.0,122000.0,154000.0


You can also use logical operators to filter based on multiple conditions.

In [None]:
#This will show values in your dataframe for which the starting salary is greater than $50,000 and midcareer salary is greater than $100,000.
salary_df[(salary_df.starting_salary > 50000) & (salary_df.midcareer_salary > 100000)].head(5)

Unnamed: 0,undergrad_major,starting_salary,midcareer_salary,percent_change_starting,midcareer_tenth_percentile,midcareer_twentyfifth_percentile,midcareer_seventyfifth_percentile,midcareer_nintieth_percentile
1,Aerospace Engineering,57700.0,101000.0,75.0,64300.0,82100.0,127000.0,161000.0
8,Chemical Engineering,63200.0,107000.0,69.3,71900.0,87300.0,143000.0,194000.0
12,Computer Engineering,61400.0,105000.0,71.0,66100.0,84100.0,135000.0,162000.0
19,Electrical Engineering,60900.0,103000.0,69.1,69300.0,83800.0,130000.0,168000.0


# **Resources**

1. [NumPy Official Website](https://numpy.org/)
2. [List of NumPy Math Functions](https://numpy.org/doc/stable/reference/routines.math.html)
3. [Official Pandas Website](https://pandas.pydata.org/)

# **Project Tasks**

We will mainly be working with 2 datasets for the final project. The first one is a dataset with information on tweets about the COVID-19 vaccination across the world. The second one is a dataset tracking the progress different countries made towards vaccinating people.

1. Open these links and download the 2 datasets.

  - [COVID-19 All Vaccines Tweets](https://www.kaggle.com/gpreda/all-covid19-vaccines-tweets)
  - [COVID-19 World Vaccination Progress](https://www.kaggle.com/gpreda/covid-world-vaccination-progress)

2. Upload these datasets on Python as dataframes (either using the method we discussed in this lab or some other method you find easier).

3. Explore the dataset using dataframe summary functions: .info(), .shape, .columns. You could also look at the data types of the values of different columns to see if they make sense.

4. Although these datasets are already cleaned, can you think of more ways of cleaning the dataset?

5. Your final project will be focused on one idea/topic you want to explore. It's useful to start thinking about what you would like to work on! 

  Look at the datasets and come up with 2-3 ideas you would like to work on for your final project. For example, one potentially interesting project idea is to see if there's any correlation between sentiments and vaccination rate. This is just for brainstorming – you don't have to finalize any ideas yet. 

  You can also do some background research on the topic to come up with ideas. Some useful links to check out if you're interested:

    - [Sentiment Analysis of COVID-19 Vaccine Tweets Using Machine Learning](https://papers.ssrn.com/sol3/papers.cfm?abstract_id=3869531)
    - [Tweet Topics and Sentiments Relating to COVID-19 Vaccination Among Australian Twitter Users](https://pubmed.ncbi.nlm.nih.gov/33886492/)
    - [Insight from NLP Analysis](https://arxiv.org/pdf/2106.04081.pdf)
    - [An analysis of COVID-19 vaccine sentiments and opinions on Twitter](https://www.ijidonline.com/article/S1201-9712(21)00462-8/fulltext)

  

## Written by Vaidehi Bulusu (edited by Akira and Ramisha)

---