## What is Pandas?

Pandas is a **library** in Python that is designed for **data manipulation and analysis**

Especially tabular data, as in an SQL table or Excel spreadsheet. So things like:
* Time series data
* Arbitrary matrix data with meaningful row and column labels
* Any other form of observational / statistical data sets

![](https://firebasestorage.googleapis.com/v0/b/firescript-577a2.appspot.com/o/imgs%2Fapp%2Fmegacoglab%2FevDDX8yw-s.png?alt=media&token=8b81e706-a68e-40ac-a3aa-4bccd403fd45)

### Example / motivating use cases

Back to last week's demo!

## Importing the pandas library (getting started)

### What is a library?

You can think of a library is a **collection of functions and data structures**. You *import* a library (or subsets of it) into your program / notebook so you have access to special functions or data structures in your program.

You are already using Python's standard library, which includes built-in functions like `print()`, and built-in data structures like `str` and `dict`. Every time you fire up Python, these are "imported" into your program in the background.

As you advance in your programming career, you will often find that you want to solve some (sub)problems that others have tried to do, and wrote a collection of functions and/or data structures to solve those problems really well, and saved that collection into a library that others can use. Take advantage of this!

### You should learn how to read documentation for libraries

You should have handy access to (and know how to use):
- Docs for "ground truth"
- Some collection of examples for references.

The pandas website is decent place to start: https://pandas.pydata.org/

This "cheat sheet" is also a really helpful guide to more common operations that you may run into later: https://pandas.pydata.org/Pandas_Cheat_Sheet.pdf

There are also many blogs that are helpful, like towardsdatascience.com

The cool thing about pandas and data analysis in python is that many people share notebooks that you can inspect / learn from / adapt code for your own projects (just like mine!).

Learning how to use libraries is training for learning to code in teams, using code from others. Basically nobody writes anything all from scratch, unless they are trying to *really* **REALLY** learn something deeply.

### "importing" a library: mechanics

Here's what it looks like to import a library and use it, conceptually with a "fake" library, and with the pandas library

![](https://firebasestorage.googleapis.com/v0/b/firescript-577a2.appspot.com/o/imgs%2Fapp%2Fmegacoglab%2F0P_oLpCNzy.png?alt=media&token=0908b80b-a761-4588-85b5-7c87bee6bf0e)

We often want to import libraries with "as"

The name after `as` is sort of like a variable name; usually we do that if the library name is clunky, or might conflict with variable names we want to use

For pandas, by convention people usually import it `as pd`.

![](https://firebasestorage.googleapis.com/v0/b/firescript-577a2.appspot.com/o/imgs%2Fapp%2Fmegacoglab%2FLJ66aL66zf.png?alt=media&token=e0ae487c-910f-48bc-a8f2-ac42424009a9)

Let's do that quickly to illustrate.

In [15]:
# import the pandas library, give it the name pd for easier access
import pandas as pd

In [None]:
# test here
pd.

## The core of Pandas: The dataframe data structure

We've so far progressed from single-item data structures (`str`, `int`, `float`) to "basic" collections (`list`, `dict`)

Now we will learn about the `dataframe`, which has:
* nice properties of both lists (*orderable, indexable*) and dictionaries (can *retrieve things quickly by key, store associated values*)
* and othe properties and *built-in algorithms and methods* that are useful for data analysis (e.g., summarizing, grouping, statistics, etc.)

Remember: **data structures and algorithms go hand in hand**: people made dataframes (and the associated pandas library) so we can do particular kinds of algorithms more easily.

Dataframes are basically like smart spreadsheets that Python can read/write

![](https://firebasestorage.googleapis.com/v0/b/firescript-577a2.appspot.com/o/imgs%2Fapp%2Fmegacoglab%2Fpf7jrFPGel.png?alt=media&token=d9992446-f30f-436b-9102-fa2baab5f3b0)

The data is in rows and columns. Columns in pandas are special data structures called `series`.

More [here](https://www.geeksforgeeks.org/python-pandas-dataframe/)

#### Dataframes combine the best characteristics of lists and dictionaries, and more!

- Can sort
- Can access data by key
- Can also reindex easily!

In [18]:
# integrated practice! how do we specify directions to the INST courses file?
fpath = '../resources/INST courses.csv'
df = pd.read_csv(fpath)
df.head(10)

Unnamed: 0,Code,Title,Description,Prereqs,Credits
0,INST126,Introduction to Programming for Information Sc...,An introduction to computer programming for st...,Minimum grade of C- in MATH115; or must have m...,3.0
1,INST201,Introduction to Information Science,Examining the effects of new information techn...,,3.0
2,INST311,Information Organization,"Examines the theories, concepts, and principle...",Must have completed or be concurrently enrolle...,3.0
3,INST314,Statistics for Information Science,Basic concepts in statistics including measure...,Must have completed or be concurrently enrolle...,3.0
4,INST326,Object-Oriented Programming for Information Sc...,"An introduction to programming, emphasizing un...",1 course with a minimum grade of C- from (INST...,3.0
5,INST327,Database Design and Modeling,"Introduction to databases, the relational mode...",1 course with a minimum grade of C- from (CMSC...,3.0
6,INST335,Teams and Organizations,"Team development and the principles, methods a...",1 course with a minimum grade of C- from (INST...,3.0
7,INST346,Technologies Infrastructure and Architecture,Examines the basic concepts of local and wide-...,1 course with a minimum grade of C- from (INST...,3.0
8,INST352,Information User Needs and Assessment,"Focuses on use of information by individuals, ...",1 course with a minimum grade of C- from (INST...,3.0
9,INST354,Decision-Making for Information Science,Examines the use of information in organizatio...,INST314.,3.0


In [20]:
# show me the "columns"
df.columns

Index(['Code', 'Title', 'Description', 'Prereqs', 'Credits'], dtype='object')

In [24]:
# get the code column
df['Code']

0      INST126
1      INST201
2      INST311
3      INST314
4      INST326
5      INST327
6      INST335
7      INST346
8      INST352
9      INST354
10     INST362
11     INST377
12    INST408Y
13    INST408Z
14     INST414
15     INST447
16     INST462
17     INST466
18     INST490
19     INST604
20     INST612
21     INST614
22     INST616
23     INST622
24     INST627
25     INST630
26     INST652
27     INST702
28     INST709
29    INST728G
30    INST728V
31     INST733
32     INST737
33     INST741
34     INST742
35     INST746
36     INST762
37     INST767
38     INST776
39     INST785
40     INST794
Name: Code, dtype: object

In [25]:
# find the courses that are 3 credits
df[df['Credits'] == 3.0]

Unnamed: 0,Code,Title,Description,Prereqs,Credits
0,INST126,Introduction to Programming for Information Sc...,An introduction to computer programming for st...,Minimum grade of C- in MATH115; or must have m...,3.0
1,INST201,Introduction to Information Science,Examining the effects of new information techn...,,3.0
2,INST311,Information Organization,"Examines the theories, concepts, and principle...",Must have completed or be concurrently enrolle...,3.0
3,INST314,Statistics for Information Science,Basic concepts in statistics including measure...,Must have completed or be concurrently enrolle...,3.0
4,INST326,Object-Oriented Programming for Information Sc...,"An introduction to programming, emphasizing un...",1 course with a minimum grade of C- from (INST...,3.0
5,INST327,Database Design and Modeling,"Introduction to databases, the relational mode...",1 course with a minimum grade of C- from (CMSC...,3.0
6,INST335,Teams and Organizations,"Team development and the principles, methods a...",1 course with a minimum grade of C- from (INST...,3.0
7,INST346,Technologies Infrastructure and Architecture,Examines the basic concepts of local and wide-...,1 course with a minimum grade of C- from (INST...,3.0
8,INST352,Information User Needs and Assessment,"Focuses on use of information by individuals, ...",1 course with a minimum grade of C- from (INST...,3.0
9,INST354,Decision-Making for Information Science,Examines the use of information in organizatio...,INST314.,3.0


In [26]:
# find all courses where the title contains the word introduction
df[df['Title'].str.contains("Introduction")]

Unnamed: 0,Code,Title,Description,Prereqs,Credits
0,INST126,Introduction to Programming for Information Sc...,An introduction to computer programming for st...,Minimum grade of C- in MATH115; or must have m...,3.0
1,INST201,Introduction to Information Science,Examining the effects of new information techn...,,3.0
16,INST462,Introduction to Data Visualization,"Exploration of the theories, methods, and tech...",INST314.,3.0
19,INST604,Introduction to Archives and Digital Curation,"Overview of the principles, practices, and app...",,3.0
25,INST630,Introduction to Programming for the Informatio...,An introduction to computer programming intend...,,3.0
32,INST737,Introduction to Data Science,An exploration of some of the best and most ge...,"INST627; and (LBSC690, LBSC671, or INFM603). O...",3.0


In [27]:
df.head(10) # show me the top 10 rows in the dataframe

Unnamed: 0,Code,Title,Description,Prereqs,Credits
0,INST126,Introduction to Programming for Information Sc...,An introduction to computer programming for st...,Minimum grade of C- in MATH115; or must have m...,3.0
1,INST201,Introduction to Information Science,Examining the effects of new information techn...,,3.0
2,INST311,Information Organization,"Examines the theories, concepts, and principle...",Must have completed or be concurrently enrolle...,3.0
3,INST314,Statistics for Information Science,Basic concepts in statistics including measure...,Must have completed or be concurrently enrolle...,3.0
4,INST326,Object-Oriented Programming for Information Sc...,"An introduction to programming, emphasizing un...",1 course with a minimum grade of C- from (INST...,3.0
5,INST327,Database Design and Modeling,"Introduction to databases, the relational mode...",1 course with a minimum grade of C- from (CMSC...,3.0
6,INST335,Teams and Organizations,"Team development and the principles, methods a...",1 course with a minimum grade of C- from (INST...,3.0
7,INST346,Technologies Infrastructure and Architecture,Examines the basic concepts of local and wide-...,1 course with a minimum grade of C- from (INST...,3.0
8,INST352,Information User Needs and Assessment,"Focuses on use of information by individuals, ...",1 course with a minimum grade of C- from (INST...,3.0
9,INST354,Decision-Making for Information Science,Examines the use of information in organizatio...,INST314.,3.0


## Common operations (basic)

Let's go over some common operations with dataframes. This will overlap with your PCE, mostly Q1-5 and Q8.

### Constructing a dataframe

#### From other data structures (e.g., lists, dictionaries)

Seldom use this at the start (usually we import data from an external file like a `.csv` file into a dataframe.

But I do use this frequently when I'm creating new dataframes for analysis from existing data(frames). Might not be the best pattern to emulate (but it works for me!): a lot of what I do could probably be done more elegantly with proper use of `.groupby()` and `.apply()` (more on this next week).

In [30]:
num = "5"
num_int = int(num)

In [28]:
basic_data = [
    {'name': 'Joel', 'role': 'instructor'},
    {'name': 'Sarah', 'role': 'UTA'}
]
example_df = pd.DataFrame(basic_data)

In [29]:
example_df

Unnamed: 0,name,role
0,Joel,instructor
1,Sarah,UTA


In [None]:
example_df.sort_values(by="name", ascending=False)

Unnamed: 0,name,role
1,Sarah,UTA
0,Joel,instructor


#### From (external) data files

Most frequently this is done with `.read_csv()`, but there are many other common formats, such as `json`. See [here](https://pandas.pydata.org/pandas-docs/stable/user_guide/io.html) for a full listing

csv stands for comma-separated-values

commonly used because it's *plain-text*, technically. this means any program that can read a string can read this file. and have it be meaningful. not so with excel files!

In [None]:
flines = open(fpath, 'r').readlines()

In [None]:
for line in flines:
    elements = line.split(",")
    print(elements)

['Code', 'Title', 'Description', 'Prereqs', 'Credits\n']
['INST126', 'Introduction to Programming for Information Science', '"An introduction to computer programming for students with very limited or no previous programming experience. Topics include fundamental programming concepts such as variables', ' data types', ' assignments', ' arrays', ' conditionals', ' loops', ' functions', ' and I/O operations."', 'Minimum grade of C- in MATH115; or must have math eligibility of MATH140 or higher; or permission of instructor.', '3\n']
['INST201', 'Introduction to Information Science', '"Examining the effects of new information technologies on how we conduct business', ' interact with friends', ' and go through our daily lives. Understanding how technical and social factors have influenced the evolution of information society. Evaluating the transformative power of information in education', ' policy', ' and entertainment', ' and the dark side of these changes."', 'None', '3\n']
['INST311', '

In [31]:
df = pd.read_csv(fpath) # needs a path to a csv file
df

Unnamed: 0,Code,Title,Description,Prereqs,Credits
0,INST126,Introduction to Programming for Information Sc...,An introduction to computer programming for st...,Minimum grade of C- in MATH115; or must have m...,3.0
1,INST201,Introduction to Information Science,Examining the effects of new information techn...,,3.0
2,INST311,Information Organization,"Examines the theories, concepts, and principle...",Must have completed or be concurrently enrolle...,3.0
3,INST314,Statistics for Information Science,Basic concepts in statistics including measure...,Must have completed or be concurrently enrolle...,3.0
4,INST326,Object-Oriented Programming for Information Sc...,"An introduction to programming, emphasizing un...",1 course with a minimum grade of C- from (INST...,3.0
5,INST327,Database Design and Modeling,"Introduction to databases, the relational mode...",1 course with a minimum grade of C- from (CMSC...,3.0
6,INST335,Teams and Organizations,"Team development and the principles, methods a...",1 course with a minimum grade of C- from (INST...,3.0
7,INST346,Technologies Infrastructure and Architecture,Examines the basic concepts of local and wide-...,1 course with a minimum grade of C- from (INST...,3.0
8,INST352,Information User Needs and Assessment,"Focuses on use of information by individuals, ...",1 course with a minimum grade of C- from (INST...,3.0
9,INST354,Decision-Making for Information Science,Examines the use of information in organizatio...,INST314.,3.0


### Inspecting your dataframe

Common operations:
- summarizing
- filtering / accessing
- sorting

#### Summarizing

With:
- `.head()`
- `.describe()`
- various stats

In [36]:
# we have a dataframe named df
# df has a method called head
# can optionally pass in a parameter to tell how many rows from the top to return
df.head(10) # show the top 10

Unnamed: 0,Code,Title,Description,Prereqs,Credits
0,INST126,Introduction to Programming for Information Sc...,An introduction to computer programming for st...,Minimum grade of C- in MATH115; or must have m...,3.0
1,INST201,Introduction to Information Science,Examining the effects of new information techn...,,3.0
2,INST311,Information Organization,"Examines the theories, concepts, and principle...",Must have completed or be concurrently enrolle...,3.0
3,INST314,Statistics for Information Science,Basic concepts in statistics including measure...,Must have completed or be concurrently enrolle...,3.0
4,INST326,Object-Oriented Programming for Information Sc...,"An introduction to programming, emphasizing un...",1 course with a minimum grade of C- from (INST...,3.0
5,INST327,Database Design and Modeling,"Introduction to databases, the relational mode...",1 course with a minimum grade of C- from (CMSC...,3.0
6,INST335,Teams and Organizations,"Team development and the principles, methods a...",1 course with a minimum grade of C- from (INST...,3.0
7,INST346,Technologies Infrastructure and Architecture,Examines the basic concepts of local and wide-...,1 course with a minimum grade of C- from (INST...,3.0
8,INST352,Information User Needs and Assessment,"Focuses on use of information by individuals, ...",1 course with a minimum grade of C- from (INST...,3.0
9,INST354,Decision-Making for Information Science,Examines the use of information in organizatio...,INST314.,3.0


In [None]:
import random
df['random_number'] = [c + random.randint(0,5) for c in df['Credits']]

In [None]:
df.describe()

Unnamed: 0,Credits,random_number
count,36.0,36.0
mean,3.0,5.666667
std,0.0,1.621287
min,3.0,3.0
25%,3.0,4.0
50%,3.0,6.0
75%,3.0,7.0
max,3.0,8.0


In [None]:
df['random_number'].mean()

5.666666666666667

In [None]:
df.describe()

Unnamed: 0,Credits,random_number
count,36.0,36.0
mean,3.0,5.666667
std,0.0,1.621287
min,3.0,3.0
25%,3.0,4.0
50%,3.0,6.0
75%,3.0,7.0
max,3.0,8.0


In [None]:
df.head(10)

Unnamed: 0,Code,Title,Description,Prereqs,Credits,random_number
0,INST126,Introduction to Programming for Information Sc...,An introduction to computer programming for st...,Minimum grade of C- in MATH115; or must have m...,3.0,8.0
1,INST201,Introduction to Information Science,Examining the effects of new information techn...,,3.0,6.0
2,INST311,Information Organization,"Examines the theories, concepts, and principle...",Must have completed or be concurrently enrolle...,3.0,7.0
3,INST314,Statistics for Information Science,Basic concepts in statistics including measure...,Must have completed or be concurrently enrolle...,3.0,4.0
4,INST326,Object-Oriented Programming for Information Sc...,"An introduction to programming, emphasizing un...",1 course with a minimum grade of C- from (INST...,3.0,4.0
5,INST327,Database Design and Modeling,"Introduction to databases, the relational mode...",1 course with a minimum grade of C- from (CMSC...,3.0,7.0
6,INST335,Teams and Organizations,"Team development and the principles, methods a...",1 course with a minimum grade of C- from (INST...,3.0,7.0
7,INST346,Technologies Infrastructure and Architecture,Examines the basic concepts of local and wide-...,1 course with a minimum grade of C- from (INST...,3.0,8.0
8,INST352,Information User Needs and Assessment,"Focuses on use of information by individuals, ...",1 course with a minimum grade of C- from (INST...,3.0,8.0
9,INST354,Decision-Making for Information Science,Examines the use of information in organizatio...,INST314.,3.0,6.0


#### Subsetting / getting/accessing parts of our dataframe

Most basic is just getting a specific column. Looks like the basic way we index things in lists or dictionaries.

In [None]:
df['Code']

0      INST126
1      INST201
2      INST311
3      INST314
4      INST326
5      INST327
6      INST335
7      INST346
8      INST352
9      INST354
10     INST362
11     INST377
12    INST408Y
13    INST408Z
14     INST414
15     INST447
16     INST462
17     INST466
18     INST490
19     INST604
20     INST612
21     INST614
22     INST616
23     INST622
24     INST627
25     INST630
26     INST652
27     INST702
28     INST709
29    INST728G
30    INST728V
31     INST733
32     INST737
33     INST741
34     INST742
35     INST746
36     INST762
37     INST767
38     INST776
39     INST785
40     INST794
Name: Code, dtype: object

Let's say you want a particular statistic for only one column. You can do this by accessing the series, and asking for a specific statistic.

In [None]:
df['random_number'].median()

6.0

But we sometimes also want to get **subsets** of the data, depending on one or more column values.

We can do this with indexing notation (I use this because I'm used to it).

In [None]:
df[df['random_number'] <= 5] # get me all the rows where the value of the column random_number is less than or equal to 5

Unnamed: 0,Code,Title,Description,Prereqs,Credits,random_number
1,INST201,Introduction to Information Science,Examining the effects of new information techn...,,3.0,4.0
4,INST326,Object-Oriented Programming for Information Sc...,"An introduction to programming, emphasizing un...",1 course with a minimum grade of C- from (INST...,3.0,5.0
6,INST335,Teams and Organizations,"Team development and the principles, methods a...",1 course with a minimum grade of C- from (INST...,3.0,5.0
7,INST346,Technologies Infrastructure and Architecture,Examines the basic concepts of local and wide-...,1 course with a minimum grade of C- from (INST...,3.0,4.0
10,INST362,User-Centered Design,Introduction to human-computer interaction (HC...,1 course with a minimum grade of C- from (INST...,3.0,4.0
14,INST414,Data Science Techniques,An exploration of how to extract insights from...,INST314.,3.0,3.0
22,INST616,Open Source Intelligence,An introduction to Open Source Intelligence (O...,,3.0,4.0
23,INST622,Information and Universal Usability,Information services and technologies to provi...,,3.0,3.0
24,INST627,Data Analytics for Information Professionals,"Skills and knowledge needed to craft datasets,...",,3.0,3.0
26,INST652,Design Thinking and Youth,Methods of design thinking specifically within...,,3.0,5.0


In [None]:
def code_to_level(code):
    return f'{code[4]}00'

df['Level'] = df['Code'].apply(code_to_level)
df.head(10)

Unnamed: 0,Code,Title,Description,Prereqs,Credits,random_number,Level
0,INST126,Introduction to Programming for Information Sc...,An introduction to computer programming for st...,Minimum grade of C- in MATH115; or must have m...,3.0,7.0,100
1,INST201,Introduction to Information Science,Examining the effects of new information techn...,,3.0,4.0,200
2,INST311,Information Organization,"Examines the theories, concepts, and principle...",Must have completed or be concurrently enrolle...,3.0,7.0,300
3,INST314,Statistics for Information Science,Basic concepts in statistics including measure...,Must have completed or be concurrently enrolle...,3.0,7.0,300
4,INST326,Object-Oriented Programming for Information Sc...,"An introduction to programming, emphasizing un...",1 course with a minimum grade of C- from (INST...,3.0,5.0,300
5,INST327,Database Design and Modeling,"Introduction to databases, the relational mode...",1 course with a minimum grade of C- from (CMSC...,3.0,7.0,300
6,INST335,Teams and Organizations,"Team development and the principles, methods a...",1 course with a minimum grade of C- from (INST...,3.0,5.0,300
7,INST346,Technologies Infrastructure and Architecture,Examines the basic concepts of local and wide-...,1 course with a minimum grade of C- from (INST...,3.0,4.0,300
8,INST352,Information User Needs and Assessment,"Focuses on use of information by individuals, ...",1 course with a minimum grade of C- from (INST...,3.0,7.0,300
9,INST354,Decision-Making for Information Science,Examines the use of information in organizatio...,INST314.,3.0,7.0,300


In [None]:
df[(df['Level'] >= '600') & (df['Prereqs'] == "None")]

Unnamed: 0,Code,Title,Description,Prereqs,Credits,random_number,Level
19,INST604,Introduction to Archives and Digital Curation,"Overview of the principles, practices, and app...",,3.0,6.0,600
20,INST612,Information Policy,"Nature, structure, development and application...",,3.0,7.0,600
21,INST614,Literacy and Inclusion,The educational and psychological dimensions o...,,3.0,6.0,600
22,INST616,Open Source Intelligence,An introduction to Open Source Intelligence (O...,,3.0,4.0,600
23,INST622,Information and Universal Usability,Information services and technologies to provi...,,3.0,3.0,600
24,INST627,Data Analytics for Information Professionals,"Skills and knowledge needed to craft datasets,...",,3.0,3.0,600
25,INST630,Introduction to Programming for the Informatio...,An introduction to computer programming intend...,,3.0,7.0,600
26,INST652,Design Thinking and Youth,Methods of design thinking specifically within...,,3.0,5.0,600


In [None]:
df[df['Code'].str.startswith("INST3")] # get all the rows where the value of the code column starts with the string INST3

Unnamed: 0,Code,Title,Description,Prereqs,Credits,random_number,Level
2,INST311,Information Organization,"Examines the theories, concepts, and principle...",Must have completed or be concurrently enrolle...,3.0,7.0,300
3,INST314,Statistics for Information Science,Basic concepts in statistics including measure...,Must have completed or be concurrently enrolle...,3.0,7.0,300
4,INST326,Object-Oriented Programming for Information Sc...,"An introduction to programming, emphasizing un...",1 course with a minimum grade of C- from (INST...,3.0,5.0,300
5,INST327,Database Design and Modeling,"Introduction to databases, the relational mode...",1 course with a minimum grade of C- from (CMSC...,3.0,7.0,300
6,INST335,Teams and Organizations,"Team development and the principles, methods a...",1 course with a minimum grade of C- from (INST...,3.0,5.0,300
7,INST346,Technologies Infrastructure and Architecture,Examines the basic concepts of local and wide-...,1 course with a minimum grade of C- from (INST...,3.0,4.0,300
8,INST352,Information User Needs and Assessment,"Focuses on use of information by individuals, ...",1 course with a minimum grade of C- from (INST...,3.0,7.0,300
9,INST354,Decision-Making for Information Science,Examines the use of information in organizatio...,INST314.,3.0,7.0,300
10,INST362,User-Centered Design,Introduction to human-computer interaction (HC...,1 course with a minimum grade of C- from (INST...,3.0,4.0,300
11,INST377,Dynamic Web Applications,An exploration of the basic methods and tools ...,INST327.,3.0,8.0,300


In [None]:
df[df['Title'].str.contains("Design")] # get all the rows where the value of the code column contains the word Design

Unnamed: 0,Code,Title,Description,Prereqs,Credits,random_number
5,INST327,Database Design and Modeling,"Introduction to databases, the relational mode...",1 course with a minimum grade of C- from (CMSC...,3.0,7.0
10,INST362,User-Centered Design,Introduction to human-computer interaction (HC...,1 course with a minimum grade of C- from (INST...,3.0,7.0
26,INST652,Design Thinking and Youth,Methods of design thinking specifically within...,,3.0,6.0
31,INST733,Database Design,Principles of user-oriented database design. ...,"LBSC690, LBSC671, or INFM603; or permission of...",3.0,8.0


In [None]:
# get all the 300 level design courses?
df[(df['Title'].str.contains("Design")) & (df['Code'].str.startswith("INST3"))]

Unnamed: 0,Code,Title,Description,Prereqs,Credits,random_number
5,INST327,Database Design and Modeling,"Introduction to databases, the relational mode...",1 course with a minimum grade of C- from (CMSC...,3.0,7.0
10,INST362,User-Centered Design,Introduction to human-computer interaction (HC...,1 course with a minimum grade of C- from (INST...,3.0,7.0


In [None]:
# get all the courses that have a "minimum grade" prereq

# have to first clean the data, remove mmissing data
df['Prereqs'] = df['Prereqs'].fillna("None")

# now do the query
df[df['Prereqs'].str.contains("minimum grade")]

Unnamed: 0,Code,Title,Description,Prereqs,Credits,random_number
3,INST314,Statistics for Information Science,Basic concepts in statistics including measure...,Must have completed or be concurrently enrolle...,3.0,4.0
4,INST326,Object-Oriented Programming for Information Sc...,"An introduction to programming, emphasizing un...",1 course with a minimum grade of C- from (INST...,3.0,4.0
5,INST327,Database Design and Modeling,"Introduction to databases, the relational mode...",1 course with a minimum grade of C- from (CMSC...,3.0,7.0
6,INST335,Teams and Organizations,"Team development and the principles, methods a...",1 course with a minimum grade of C- from (INST...,3.0,7.0
7,INST346,Technologies Infrastructure and Architecture,Examines the basic concepts of local and wide-...,1 course with a minimum grade of C- from (INST...,3.0,8.0
8,INST352,Information User Needs and Assessment,"Focuses on use of information by individuals, ...",1 course with a minimum grade of C- from (INST...,3.0,8.0
10,INST362,User-Centered Design,Introduction to human-computer interaction (HC...,1 course with a minimum grade of C- from (INST...,3.0,7.0


In [None]:
for prereq in df[df['Prereqs'].str.contains("minimum grade")]['Prereqs']:
    print(prereq)

Must have completed or be concurrently enrolled in INST201; or must have completed or be concurrently enrolled in INST301. And minimum grade of C- in INST201 and INST301; and MATH115; and STAT100; and minimum grade of C- in MATH115 and STAT100.
1 course with a minimum grade of C- from (INST126, CMSC106); and must have completed or be concurrently enrolled in INST201 or INST301. And minimum grade of C- in INST201; or minimum grade of C- in INST301.
1 course with a minimum grade of C- from (CMSC106, CMSC122, INST126); and must have completed or be concurrently enrolled in INST201 or INST301; and minimum grade of C- in INST201 and INST301.
1 course with a minimum grade of C- from (INST201, INST301); and minimum grade of C- in PSYC100.
1 course with a minimum grade of C- from (INST201, INST301); and 1 course with a minimum grade of C- from (INST326, CMSC131); and minimum grade of C- in INST327.
1 course with a minimum grade of C- from (INST201, INST301); and minimum grade of C- in INST311.

#### Reshaping

Most basic is sorting. 

More advanced stuff like transposing and so on we will discuss next week.

In [None]:
df.sort_values(by="random_number") # sort in ascending order by the random_number column

Unnamed: 0,Code,Title,Description,Prereqs,Credits,random_number
40,INST794,Capstone in Youth Experience,"Through a supervised project, to synthesize de...","INST650, INST651, and INST652; or permission o...",3.0,3.0
35,INST746,Digitization of Legacy Holdings,Through hands on exercises and real-world proj...,INST604.,3.0,3.0
33,INST741,Social Computing Technologies and Applications,Tools and techniques for developing and config...,INFM603 and INFM605; or (LBSC602 and LBSC671);...,3.0,3.0
22,INST616,Open Source Intelligence,An introduction to Open Source Intelligence (O...,,3.0,3.0
21,INST614,Literacy and Inclusion,The educational and psychological dimensions o...,,3.0,3.0
16,INST462,Introduction to Data Visualization,"Exploration of the theories, methods, and tech...",INST314.,3.0,3.0
39,INST785,"Documentation, Collection, and Appraisal of Re...",Development of documentation strategies and pl...,INST604; or permission of instructor.,3.0,4.0
19,INST604,Introduction to Archives and Digital Curation,"Overview of the principles, practices, and app...",,3.0,4.0
4,INST326,Object-Oriented Programming for Information Sc...,"An introduction to programming, emphasizing un...",1 course with a minimum grade of C- from (INST...,3.0,4.0
3,INST314,Statistics for Information Science,Basic concepts in statistics including measure...,Must have completed or be concurrently enrolle...,3.0,4.0


In [None]:
# sort by the code column, but in descending order
df.sort_values(by="Code", ascending=False)

Unnamed: 0,Code,Title,Description,Prereqs,Credits,random_number
40,INST794,Capstone in Youth Experience,"Through a supervised project, to synthesize de...","INST650, INST651, and INST652; or permission o...",3.0,3.0
39,INST785,"Documentation, Collection, and Appraisal of Re...",Development of documentation strategies and pl...,INST604; or permission of instructor.,3.0,4.0
38,INST776,HCIM CAPSTONE PROJECT,The opportunity to apply the skills learned th...,INST775; or permission of instructor.,3.0,5.0
37,INST767,Big Data Infrastructure,Principles and techniques of data science and ...,INST737; or permission of instructor.,3.0,5.0
36,INST762,Visual Analytics,Visual analytics is the use of interactive vis...,INFM603 or INST630; or permission of instructor.,3.0,7.0
35,INST746,Digitization of Legacy Holdings,Through hands on exercises and real-world proj...,INST604.,3.0,3.0
34,INST742,Implementing Digital Curation,Management of and technology for application o...,INST604; or permission of instructor.,3.0,5.0
33,INST741,Social Computing Technologies and Applications,Tools and techniques for developing and config...,INFM603 and INFM605; or (LBSC602 and LBSC671);...,3.0,3.0
32,INST737,Introduction to Data Science,An exploration of some of the best and most ge...,"INST627; and (LBSC690, LBSC671, or INFM603). O...",3.0,7.0
31,INST733,Database Design,Principles of user-oriented database design. ...,"LBSC690, LBSC671, or INFM603; or permission of...",3.0,8.0


In [None]:
sorted_by_code = df.sort_values(by="Code", ascending=False)

In [None]:
df.head()

Unnamed: 0,Code,Title,Description,Prereqs,Credits,random_number
0,INST126,Introduction to Programming for Information Sc...,An introduction to computer programming for st...,Minimum grade of C- in MATH115; or must have m...,3.0,8.0
1,INST201,Introduction to Information Science,Examining the effects of new information techn...,,3.0,6.0
2,INST311,Information Organization,"Examines the theories, concepts, and principle...",Must have completed or be concurrently enrolle...,3.0,7.0
3,INST314,Statistics for Information Science,Basic concepts in statistics including measure...,Must have completed or be concurrently enrolle...,3.0,4.0
4,INST326,Object-Oriented Programming for Information Sc...,"An introduction to programming, emphasizing un...",1 course with a minimum grade of C- from (INST...,3.0,4.0


In [None]:
sorted_by_code.head()

Unnamed: 0,Code,Title,Description,Prereqs,Credits,random_number
40,INST794,Capstone in Youth Experience,"Through a supervised project, to synthesize de...","INST650, INST651, and INST652; or permission o...",3.0,3.0
39,INST785,"Documentation, Collection, and Appraisal of Re...",Development of documentation strategies and pl...,INST604; or permission of instructor.,3.0,4.0
38,INST776,HCIM CAPSTONE PROJECT,The opportunity to apply the skills learned th...,INST775; or permission of instructor.,3.0,5.0
37,INST767,Big Data Infrastructure,Principles and techniques of data science and ...,INST737; or permission of instructor.,3.0,5.0
36,INST762,Visual Analytics,Visual analytics is the use of interactive vis...,INFM603 or INST630; or permission of instructor.,3.0,7.0


In [None]:
df.sort_values(by="Prereqs", inplace=True)
df

Unnamed: 0,Code,Title,Description,Prereqs,Credits,random_number
5,INST327,Database Design and Modeling,"Introduction to databases, the relational mode...",1 course with a minimum grade of C- from (CMSC...,3.0,7.0
4,INST326,Object-Oriented Programming for Information Sc...,"An introduction to programming, emphasizing un...",1 course with a minimum grade of C- from (INST...,3.0,4.0
7,INST346,Technologies Infrastructure and Architecture,Examines the basic concepts of local and wide-...,1 course with a minimum grade of C- from (INST...,3.0,8.0
8,INST352,Information User Needs and Assessment,"Focuses on use of information by individuals, ...",1 course with a minimum grade of C- from (INST...,3.0,8.0
10,INST362,User-Centered Design,Introduction to human-computer interaction (HC...,1 course with a minimum grade of C- from (INST...,3.0,7.0
6,INST335,Teams and Organizations,"Team development and the principles, methods a...",1 course with a minimum grade of C- from (INST...,3.0,7.0
33,INST741,Social Computing Technologies and Applications,Tools and techniques for developing and config...,INFM603 and INFM605; or (LBSC602 and LBSC671);...,3.0,3.0
36,INST762,Visual Analytics,Visual analytics is the use of interactive vis...,INFM603 or INST630; or permission of instructor.,3.0,7.0
17,INST466,"Technology, Culture, and Society","Individual, cultural, and societal outcomes as...",INST201.,3.0,8.0
9,INST354,Decision-Making for Information Science,Examines the use of information in organizatio...,INST314.,3.0,6.0


In [None]:
df.sort_values(by=["Prereqs", "Code"])

Unnamed: 0,Code,Title,Description,Prereqs,Credits,random_number
5,INST327,Database Design and Modeling,"Introduction to databases, the relational mode...",1 course with a minimum grade of C- from (CMSC...,3.0,7.0
4,INST326,Object-Oriented Programming for Information Sc...,"An introduction to programming, emphasizing un...",1 course with a minimum grade of C- from (INST...,3.0,4.0
7,INST346,Technologies Infrastructure and Architecture,Examines the basic concepts of local and wide-...,1 course with a minimum grade of C- from (INST...,3.0,8.0
8,INST352,Information User Needs and Assessment,"Focuses on use of information by individuals, ...",1 course with a minimum grade of C- from (INST...,3.0,8.0
10,INST362,User-Centered Design,Introduction to human-computer interaction (HC...,1 course with a minimum grade of C- from (INST...,3.0,7.0
6,INST335,Teams and Organizations,"Team development and the principles, methods a...",1 course with a minimum grade of C- from (INST...,3.0,7.0
33,INST741,Social Computing Technologies and Applications,Tools and techniques for developing and config...,INFM603 and INFM605; or (LBSC602 and LBSC671);...,3.0,3.0
36,INST762,Visual Analytics,Visual analytics is the use of interactive vis...,INFM603 or INST630; or permission of instructor.,3.0,7.0
17,INST466,"Technology, Culture, and Society","Individual, cultural, and societal outcomes as...",INST201.,3.0,8.0
9,INST354,Decision-Making for Information Science,Examines the use of information in organizatio...,INST314.,3.0,6.0


### Aside: dataframes are (mostly) immutable

Python wants you to treat dataframes as immutable: by default, any modifications you make to a dataframe will create a modified copy (just like a string), rather than modifying the dataframe itself. 

You *can* get around this if you want, by passing in a `inplace=True` argument to most function calls.