# Introduction to Pandas and Series Data

## The Series Data Structure

The series is one of the core data structures in pandas. You think of it as a cross between a list and a dictionary. The items are all stored in an order and there are label with which you can retrieve them. An easy way to visualize this is two columns of data. The first is the special index, a lot like keys in a dictionary. While the second is your actual data. It is important to note that the data column has a label of its own and can be retrieved using the .name attribute. This is different than with dictionaries and is useful when it comes to merging multiple columns of data. 

In [1]:
import pandas as pd

In [2]:
students = ['Alice', 'Jack', 'Molly']
pd.Series(students)

0    Alice
1     Jack
2    Molly
dtype: object

In [3]:
numbers = [1,2,3]
pd.Series(numbers)

0    1
1    2
2    3
dtype: int64

Depending on the type of data is with the rest of the series will determine how None is handled. 

In [4]:
students = ['Alice', 'Jack', None]
pd.Series(students)

0    Alice
1     Jack
2     None
dtype: object

In [5]:
numbers = [1,2,None]
pd.Series(numbers)

0    1.0
1    2.0
2    NaN
dtype: float64

Notice that NaN is a different value. Second, pandas set the dtype of this series to a floating point number instead of an object or ints. 

It is important to realize that None and NaN might be used in the same way, but to pandas they are different. None is NOT equivalent to NaN. 

In [6]:
import numpy as np
np.nan == None

False

In [7]:
np.nan == np.nan

False

In [8]:
np.isnan(np.nan)

True

So keep in mind when you see NaN, its meaning is similar to None, but it is a numeric value and treated differently for efficiency reasons. 

Often you haved labeled data that you want to manipulate so creating a Series from a dictionary is common. The indexd is automatically assigned to the keys of the dictionary that you provided and not just incrementing integers. 

In [9]:
student_scores = {'Alice': 'Physics', 'Jack': 'Chemistry', 'Molly': 'English'}
s = pd.Series(student_scores)
s

Alice      Physics
Jack     Chemistry
Molly      English
dtype: object

Once the series has been created, we can get the index object using the index attribute

In [10]:
s.index

Index(['Alice', 'Jack', 'Molly'], dtype='object')

In [11]:
students = [('Alice','Brown'),('Jack','White'),('Molly','Green')]
pd.Series(students)

0    (Alice, Brown)
1     (Jack, White)
2    (Molly, Green)
dtype: object

In [12]:
# You can also separate your index creattion from the data by passing in the index as a list
pd.Series(['Physics', 'Chemistry', 'English'], index = ['Alice', 'Jack', 'Molly'])

Alice      Physics
Jack     Chemistry
Molly      English
dtype: object

What happens if your list of values in the index object are not aligned with the keys in your dictionary for creating the series?

In [13]:
student_scores = {'Alice': 'Physics', 'Jack': 'Chemistry', 'Molly': 'English'}
s = pd.Series(student_scores, index = ['Alice', 'Molly', 'Sam'])
s

Alice    Physics
Molly    English
Sam          NaN
dtype: object

## Querying a Data Series

A Pandas series can be queried either by the index position or the index label. If you don't give an index to the series when querying, the position and the label are effectively the same values. To query by numeric location, starting at zero, use the iloc attribute. To query by the index label, you can use the loc attribute. 

In [14]:
import pandas as pd

In [15]:
students_classes = {'Alice': 'Physics', 'Jack': 'Chemsitry', 'Molly': 'English', 'Sam': 'History'}
s = pd.Series(students_classes)
s

Alice      Physics
Jack     Chemsitry
Molly      English
Sam        History
dtype: object

So for this series, if you wanted to see the fourth entry we would use the iloc attribute with the parameter 3

In [16]:
s.iloc[3]

'History'

If you wanted to see what class Molly has, we would use the loc attribute with a parameter of Molly

In [17]:
s.loc['Molly']

'English'

Keep in mind that iloc and loc are attributes, not methods, so you do not use parentheses to query them, but square brackets instead, which is called the indexing operator. 

Pandas tries to make our code a bit more readable and provides a smart syntax using the indexing operator directly on the series itself. For instance, if you pass in an integer parameters, the operator will behave as if you want it to query via the iloc attribute. And if you pass in an object, it will query as if you wanted to use the label based loc attribute. 

In [18]:
s[3]

'History'

In [19]:
s['Molly']

'English'

What happens if your index is a list of integers? This is a bit complicated and Pandas cannot determine automatically wherher you are intending to query by index position or index label. So you need to be careful when using the indexing operator on the Series itself. The safer option is to be more explicit and use the iloc or loc attributes directly. 

In [20]:
class_code = {99: 'Physics', 100: 'Chemistry', 101: 'English', 102: 'History'}
s = pd.Series(class_code)
s

99       Physics
100    Chemistry
101      English
102      History
dtype: object

If we try and call s[0] we will get a key error because there is no item in the classes list with an index of zero. Instead, we have to call iloc explicity if we want the first item. 

In [21]:
s.iloc[0]

'Physics'

Lets talk about working with the data. A common task is to want to consider all of the values inside of a series and do some sort of operation, This could be trying to find a certain number, or summarizing data or transforming the data in some way. 

A typical programmatic approach to this would be to iterate over all the items in the series, and incoke the operation one is interested in. For instance, we couldcreate a Series of integers representing student grades, and uust try and get an average grade. 

In [22]:
grades = pd.Series([90,80,70,60])

total = 0
for grade in grades:
    total += grade
print(total/len(grades))

75.0


This works but it is slow. Pandas and numpy support a method of computation called vectorization. Vectorization works with most of the functions in the numpt library, including the sum function. 

In [23]:
import numpy as np

total = np.sum(grades)
print(total/len(grades))

75.0


In [24]:
# 10,000 random integers between 0 and 1000
numbers = pd.Series(np.random.randint(0,1000,10000))

numbers.head()

0    745
1    169
2    438
3    967
4    146
dtype: int32

In [25]:
len(numbers)

10000

In [26]:
%%timeit -n 100
total = 0
for number in numbers:
    total =+ number
total/len(numbers)

673 µs ± 8.09 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)


In [27]:
%%timeit -n 100
total = np.sum(numbers)
total/len(numbers)

35.8 µs ± 2.3 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)


A related feature in pandas and numpy is called broadcasting. With broadcasting, you can apply an operation to every value in the series, changing the series. For instance, if we wanted to increase every random variable by 2, we could do so quickly using the += operator directly on the Series object. 

In [28]:
numbers.head()

0    745
1    169
2    438
3    967
4    146
dtype: int32

In [29]:
numbers += 2

In [30]:
numbers.head()

0    747
1    171
2    440
3    969
4    148
dtype: int32

The .loc attribute lets you not only modify data in place, but also add new data as well. If the value you pass in as the index doesn't exist, the a new entry is added. Keep in mind that indices can have mixed types. 

In [31]:
s = pd.Series([1,2,3])
s.loc['History'] = 102
s

0            1
1            2
2            3
History    102
dtype: int64

# DataFrame

## DataFrame Data Structure

In [32]:
import pandas as pd

In [33]:
record1 = pd.Series({'Name': 'Alice', 'Class': 'Physics', 'Score': 85})
record2 = pd.Series({'Name': 'Jack', 'Class': 'Chemistry', 'Score': 82})
record3 = pd.Series({'Name': 'Helen', 'Class': 'Biology', 'Score': 90})

In [34]:
df = pd.DataFrame([record1, record2, record3], index = ['school1', 'school2', 'school3'])

df

Unnamed: 0,Name,Class,Score
school1,Alice,Physics,85
school2,Jack,Chemistry,82
school3,Helen,Biology,90


Could have also just passed in a list of dictionaries. 

In [37]:
students = [{'Name': 'Alice', 'Class': 'Physics', 'Score': 85}, 
           {'Name': 'Jack', 'Class': 'Chemistry', 'Score': 82},
           {'Name': 'Helen', 'Class': 'Biology', 'Score': 90}]

df = pd.DataFrame(students,index = ['school1', 'school2', 'school1'])

df

Unnamed: 0,Name,Class,Score
school1,Alice,Physics,85
school2,Jack,Chemistry,82
school1,Helen,Biology,90


Similar to a series, we can extract data using the .iloc and .loc attributes. Because the DataFrame is 2d,passing a single value to the loc indexing operator will return the series if there is only one row to return

In [38]:
df.loc['school2']

Name          Jack
Class    Chemistry
Score           82
Name: school2, dtype: object

It is important to remember that the indices and column names along either axes could be non-unique. In this example we have two records for school1 as different rows. If we use a single value with the DataFrame loc attribute, multiple rows of the DataFrame will return, not as a new series, but as a new DataFrame. 

In [39]:
df.loc['school1']

Unnamed: 0,Name,Class,Score
school1,Alice,Physics,85
school1,Helen,Biology,90


Pandas allows you to quickly select data based on multiple axes. For example, just listing the names of the students from school1, you supply two parameters to loc. 

In [40]:
df.loc['school1', 'Name']

school1    Alice
school1    Helen
Name: Name, dtype: object

In [42]:
df.loc['school1']['Name']

school1    Alice
school1    Helen
Name: Name, dtype: object

What would we do if we just wanted to select a single column though? For example, all the names? There are a couple ways. 

One way is to transpose the matrix and then use the loc attribute. 

In [43]:
df.T

Unnamed: 0,school1,school2,school1.1
Name,Alice,Jack,Helen
Class,Physics,Chemistry,Biology
Score,85,82,90


In [44]:
df.T.loc['Name']

school1    Alice
school2     Jack
school1    Helen
Name: Name, dtype: object

However, since iloc and loc are used for row selection, Pandas reserves the indexing operator directly on the DataFrame for column selection. Columns will always have a name. 

In [45]:
df['Name']

school1    Alice
school2     Jack
school1    Helen
Name: Name, dtype: object

Since the result of using the indexing operator is either a DataFrame or Series, you can chain operations together. For instance, we can select all of the rows which related to school1 using .loc, the project the name column from just those rows. 

In [46]:
df.loc['school1']['Name']

school1    Alice
school1    Helen
Name: Name, dtype: object

Chaining causes Pandas to return a copy of the DataFrame instead of a view on the DataFrame. For selecting data this isn't a big deal. If you are changing data though this can be a source of error.

Here is another approach. As we saw, loc does row selection, and it can take two parameters, the row index and the list of column names. The .loc attribute also supports slicing. 

If we wanted to select all rows, we can use a colon to indicate a full slice from beginning to end. Then we can add the column name or multiple columns in a list and Pandas will bring back only the columns we have asked for. 

In [47]:
df.loc[:,['Name', 'Score']]

Unnamed: 0,Name,Score
school1,Alice,85
school2,Jack,82
school1,Helen,90


It is easy to delete data in a series or a DataFrame, and we can use the drop function to do so. This function takes a single parameter, which is the index or row label, to drop. The drop function however doesn't actually change the DataFrame by default. Instead, it returns to you a copy of the DataFrame with the given rows removed. 

In [48]:
df.drop('school1')

Unnamed: 0,Name,Class,Score
school2,Jack,Chemistry,82


In [49]:
df

Unnamed: 0,Name,Class,Score
school1,Alice,Physics,85
school2,Jack,Chemistry,82
school1,Helen,Biology,90


The drop function has two interesting optional parameters. The first is called inplace, and it is it set to True, the DataFrame will be updated inplace instead of a copy being returned. The second parameter is the axes, which should be dropped. By default, this value is 0, indicating the row axis. But you can change it to 1 if you want to drop a column. 

In [50]:
copy_df = df.copy()
copy_df.drop('Name', inplace=True, axis=1)
copy_df

Unnamed: 0,Class,Score
school1,Physics,85
school2,Chemistry,82
school1,Biology,90


Another way to drop a column, and that is directly through the use of the indexing operator, using the del keyword. This way of dropping data, hoeever, takes immediate effect on the DataFrame and does not return a view. 

In [51]:
del copy_df['Class']

In [52]:
copy_df

Unnamed: 0,Score
school1,85
school2,82
school1,90


Adding a new column to the dataframe is as easy as assigning it to some value using the indexing operator. For instance, if we wanted to add a class ranking column with default values of None, we could do so by using the assignment operator after the square brackets. 

In [53]:
df['ClassRanking'] = None

In [54]:
df

Unnamed: 0,Name,Class,Score,ClassRanking
school1,Alice,Physics,85,
school2,Jack,Chemistry,82,
school1,Helen,Biology,90,


## DataFrame Indexing and Loading

In [56]:
import pandas as pd

In [57]:
df = pd.read_csv('Admission_Predict.csv')

In [58]:
df.head()

Unnamed: 0,Serial No.,GRE Score,TOEFL Score,University Rating,SOP,LOR,CGPA,Research,Chance of Admit
0,1,337,118,4,4.5,4.5,9.65,1,0.92
1,2,324,107,4,4.0,4.5,8.87,1,0.76
2,3,316,104,3,3.0,3.5,8.0,1,0.72
3,4,322,110,3,3.5,2.5,8.67,1,0.8
4,5,314,103,2,2.0,3.0,8.21,0,0.65


Notice that by deafault the index starts with 0 while the student's serial number starts with 1. Pandas created this new index for us. We can set the index to the Serial number when reading it into pandas. 

In [59]:
df = pd.read_csv('Admission_Predict.csv', index_col = 0)

In [60]:
df.head()

Unnamed: 0_level_0,GRE Score,TOEFL Score,University Rating,SOP,LOR,CGPA,Research,Chance of Admit
Serial No.,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
1,337,118,4,4.5,4.5,9.65,1,0.92
2,324,107,4,4.0,4.5,8.87,1,0.76
3,316,104,3,3.0,3.5,8.0,1,0.72
4,322,110,3,3.5,2.5,8.67,1,0.8
5,314,103,2,2.0,3.0,8.21,0,0.65


We can rename columns if we want

In [63]:
new_df = df.rename(columns = {'GRE Score': 'GRE Score', 'TOEFL Score':'TOEFL Score', 
                             'University Rating':'University Rating', 'SOP':'Statement of Purpose',
                             'LOR ':'Letter of Recommendation', 'CGPA':'CGPA', 'Reasearch':'Research',
                             'Chance of Admit ':'Chance of Admit'})
new_df.head()

Unnamed: 0_level_0,GRE Score,TOEFL Score,University Rating,Statement of Purpose,Letter of Recommendation,CGPA,Research,Chance of Admit
Serial No.,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
1,337,118,4,4.5,4.5,9.65,1,0.92
2,324,107,4,4.0,4.5,8.87,1,0.76
3,316,104,3,3.0,3.5,8.0,1,0.72
4,322,110,3,3.5,2.5,8.67,1,0.8
5,314,103,2,2.0,3.0,8.21,0,0.65


So it turns out that there was a space after LOR and that is why is didn't change before. We could manually add the space but what if it was two spaces or a tab? We should use the strip function to clear the white space and use python functions to apply it to all the column names. 

We can pass in the strip function to the rename function as the mapper parameter, and indicate which axis to apply it to. 

In [64]:
new_df = new_df.rename(mapper=str.strip, axis='columns')

In [65]:
new_df.head()

Unnamed: 0_level_0,GRE Score,TOEFL Score,University Rating,Statement of Purpose,Letter of Recommendation,CGPA,Research,Chance of Admit
Serial No.,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
1,337,118,4,4.5,4.5,9.65,1,0.92
2,324,107,4,4.0,4.5,8.87,1,0.76
3,316,104,3,3.0,3.5,8.0,1,0.72
4,322,110,3,3.5,2.5,8.67,1,0.8
5,314,103,2,2.0,3.0,8.21,0,0.65


We can also use the df.columns attribute by assigning to it a list of column names which will directly rename the columns. This will directly modify the original dataframe and is very efficient especially when you have a lot of columns and you only want to change a few. This technique is also not affected by subtle errors in the columns names, a problem that we just encountered. With a list, you can use the list index to change a certain calue or use list comprehension to change all the values. 

In [66]:
# First get the list
cols = list(df.columns)
# Strip and lowercase the list elements
cols = [x.strip().lower() for x in cols]
# Then overwrite what is in the columns attribute
df.columns = cols
df.head()

Unnamed: 0_level_0,gre score,toefl score,university rating,sop,lor,cgpa,research,chance of admit
Serial No.,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
1,337,118,4,4.5,4.5,9.65,1,0.92
2,324,107,4,4.0,4.5,8.87,1,0.76
3,316,104,3,3.0,3.5,8.0,1,0.72
4,322,110,3,3.5,2.5,8.67,1,0.8
5,314,103,2,2.0,3.0,8.21,0,0.65


## Querying a DataFrame