## CIS 9
## Pandas, Data Analysis, Data Cleaning

Reading
<br>Python Data Science Handbook Chapter 3
- Introducing Pandas Objects
- Data Indexing and Selection
- Handling Missing Data, section on NaN
- Combining Datasets: Concat and Append, section on concat
- Aggregation and Grouping, section on groupby
- Vectorized String Operations, up to but not including the Example Recipe Database

Comparison of different data storage:
- A Python list can store different types of data and can change size, but the flexibility makes indexing and calculation of data in a list slow.
- A numpy array can only store only one data type and has fixed size, therefore indexing and calculation of data in a numpy array is very fast.
- A pandas data structure can store different types of data, so indexing data is a little slower than numpy but still faster than a list, and when calculations are done with numeric data, they are done using numpy and are very fast. 
<br>For data analysis purpose, this is best of both worlds! We get some of the flexibility and all the calculation speed.
<br><br>A pandas DataFrame (a 2D structure) is the workhorse of data analysis.

Import libraries

In [81]:
import pandas as pd
import numpy as np   
# Pandas doesn't need importing of numpy, this import is for when we need numpy directly.

Pandas __Series__: 1D sequence of data

In [82]:
# 1. A Series is similar to a Python list, with data and indices
nums = pd.Series([1,5,2,8,3])
print(nums, '\n')
print(nums, "\n")
print(nums.values, '\n')
print(nums.index, '\n')
print(nums[0], '\n')
print(nums[:3])

0    1
1    5
2    2
3    8
4    3
dtype: int64 

0    1
1    5
2    2
3    8
4    3
dtype: int64 

[1 5 2 8 3] 

RangeIndex(start=0, stop=5, step=1) 

1 

0    1
1    5
2    2
dtype: int64


In [83]:
# 2. Internally, numeric data are stored in a numpy array
nums = pd.Series([0, -2.5, 8, -.7, 3])
print(type(nums[0]))
# and numpy operations can be used with numeric data
np.sum(nums)     

# Jupyter Notebook tip:
# In Jupyter Notebook, you don't have to use print() if the code in the cell produces 
# one output, because Jupyter Notebook will automatically print the output.
# If the code produces multiple output, then a print() is needed for all output except 
# the last output, because Jupyter Notebook will only display the last one.

# For these exercise notebooks, I use print() for all output except the last one,
# or when we need to print text along with the output

<class 'numpy.float64'>


7.8

In [84]:
# 3. A Series is more flexible than a Python list because we can customize the indices.
# In this way, a Series behaves similar to a Python dictionary
nums = pd.Series([99, 85, 72, 89], index=['quiz1', 'quiz2', 'quiz3', 'quiz4'])
print(nums, '\n')
print("Quiz 1:", nums['quiz1'])

# Pandas provides an easier way to type when indexing data:
print("Quiz 1:", nums.quiz1)

quiz1    99
quiz2    85
quiz3    72
quiz4    89
dtype: int64 

Quiz 1: 99
Quiz 1: 99


In [85]:
# 4. In addition to creating a Series from a Python list, we can create a Series
# from a Python dictionary
d = {c:ord(c) for c in "ABCDE"}
letters = pd.Series(d)
print(letters.A, '\n')
letters['C':'E']

65 



C    67
D    68
E    69
dtype: int64

Pandas __Dataframe__: 2D table

In [86]:
# 5. A DataFrame is a 2D table with rows and columns, similar to a Python list of lists or
# a numpy 2D array
df = pd.DataFrame([ [90, 92], [73, 82],[79, 80], [97, 95] ])
df

Unnamed: 0,0,1
0,90,92
1,73,82
2,79,80
3,97,95


In [87]:
# 6. Just like with Series, we can customize the column indices.
# Each columnn of a DataFrame is a Series.
df = pd.DataFrame(columns=["quiz1", "quiz2"],
                  data=[ [90, 92], [73, 82],[79, 80], [97, 95] ])

print(df, '\n')   # note the difference between Python print vs. Jupyter Notebook print 
np.median(df.quiz2)

# Why do numpy operations work on a column of a DataFrame?
# because each column is a Series, and each Series is a numpy array

   quiz1  quiz2
0     90     92
1     73     82
2     79     80
3     97     95 



87.0

In [88]:
# 7. Don't forget that an advantage of a DataFrame is that each column can have its own type 
# of data
df = pd.DataFrame(columns=["Names", "quiz1", "quiz2"],
                  data=[ ["Fred",90,92.5], ["Wilma",73,82],["Barney",79,80], ["Betty",90,95] ])
df

# Why does quiz2 contain floats while quiz1 contains ints?
# because quiz2 Series or numpy array has a float in the initial value, which causes
# all values in the array to be floats
# quiz1 series are initialized with all ints

Unnamed: 0,Names,quiz1,quiz2
0,Fred,90,92.5
1,Wilma,73,82.0
2,Barney,79,80.0
3,Betty,90,95.0


__Accessing__ data

In [89]:
# 8. We've seen the . (dot) notation to index a column:
print(df.quiz1, '\n')
print("---")
# Use .columns with numeric column indices and []
print(df.columns[1:3])
print("---")
print(df[df.columns[1:3]], '\n')
# there is no df.rows
print("---")
# Use .loc for row and column indices:
print(df.loc[1:3], '\n')     # Note the inclusive ending for .loc
print("---")
print(df.loc[:,['quiz1','quiz2']],'\n')
print(df.loc[2,['quiz1']],'\n')

# When accessing a single element, it's faster to use .at:
print(df.at[2,'quiz1'],'\n')

0    90
1    73
2    79
3    90
Name: quiz1, dtype: int64 

---
Index(['quiz1', 'quiz2'], dtype='object')
---
   quiz1  quiz2
0     90   92.5
1     73   82.0
2     79   80.0
3     90   95.0 

---
    Names  quiz1  quiz2
1   Wilma     73   82.0
2  Barney     79   80.0
3   Betty     90   95.0 

---
   quiz1  quiz2
0     90   92.5
1     73   82.0
2     79   80.0
3     90   95.0 

quiz1    79
Name: 2, dtype: object 

79 



In [90]:
# 9. To access data with a specific value or range of values in a column:
print(df[df.quiz1 == 90], '\n')
print(df[df.quiz2 < 90],'\n')
print(df[df.Names == "Betty"],'\n')

# Write 1 print statement to print the names of students with quiz2 score greater than 90
print(df[df.quiz2 > 90].Names)

   Names  quiz1  quiz2
0   Fred     90   92.5
3  Betty     90   95.0 

    Names  quiz1  quiz2
1   Wilma     73   82.0
2  Barney     79   80.0 

   Names  quiz1  quiz2
3  Betty     90   95.0 

0     Fred
3    Betty
Name: Names, dtype: object


__Reading__ from files

In [91]:
# 10. If the file is a column of data, it will be read into a Series
quiz1 = pd.read_csv("quiz_scores.csv")
print(quiz1, '\n')

# Note: for data analysis, all scores in the examples in this notebook are out of 50.

# If the file is a csv file with rows and columns, it will be read into a DataFrame
gradebook = pd.read_csv("scores.csv")
print(gradebook, '\n')

gradebook = pd.read_csv("scores.csv", index_col='Student')
print(gradebook, '\n')

gradebook = pd.read_csv("scores.csv", header=0, names=["name","q1","midt","q2","final"])
print(gradebook,'\n')

gradebook = pd.read_csv("scores.csv", header=0, names=["q1","midt","q2","final"])
print(gradebook)

   quiz1
0     43
1     33
2     48
3     40
4     46
5     48
6     38
7     41 

      Student  quiz1  midterm  quiz2  final
0      Sleepy     43     34.0     34     35
1       Happy     33     18.0     23     50
2         Doc     48     42.0     36     37
3      Grumpy     40     23.5     40     45
4     Bashful     46     42.5     46     41
5      Sneezy     48     39.5     48     43
6       Dopey     38     45.0     39     32
7  Snow White     41     44.0     39     41 

            quiz1  midterm  quiz2  final
Student                                 
Sleepy         43     34.0     34     35
Happy          33     18.0     23     50
Doc            48     42.0     36     37
Grumpy         40     23.5     40     45
Bashful        46     42.5     46     41
Sneezy         48     39.5     48     43
Dopey          38     45.0     39     32
Snow White     41     44.0     39     41 

         name  q1  midt  q2  final
0      Sleepy  43  34.0  34     35
1       Happy  33  18.0  23     50
2 

In [92]:
# 11. We can also read from Excel files (among many other common types: HTML, JSON, SQL, etc.)
gradebook = pd.read_excel("scores.xlsx", index_col='Student')
gradebook

Unnamed: 0_level_0,quiz1,midterm,quiz2,final
Student,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Sleepy,43,34.0,34,35
Happy,33,20.0,23,49
Doc,48,32.0,36,37
Grumpy,40,23.5,40,45
Bashful,46,42.5,46,31
Sneezy,48,38.5,48,43
Dopey,38,45.0,39,32
Snow White,41,48.0,39,41


In [93]:
# 12. From the gradebook in the cell above:
# print the data for the student named Doc?
print(gradebook.loc['Doc'], '\n')
# print Dopey's quiz1 and quiz2?
print(gradebook.loc['Dopey',['quiz1', 'quiz2']])

quiz1      48.0
midterm    32.0
quiz2      36.0
final      37.0
Name: Doc, dtype: float64 

quiz1    38.0
quiz2    39.0
Name: Dopey, dtype: float64


Show __attributes__

In [94]:
#13. 
print(gradebook.index)          # row indices or labels
print(gradebook.index.values)
print(gradebook.columns      )  # column indices or labels
print(gradebook.columns.values, '\n')
print(gradebook.shape)
print(len(gradebook), '\n')
print(gradebook.head(), '\n')
gradebook.tail(3)

# what do head() and tail() do?
# head(): print the first 5 rows
# tail(): print the last 5 rows

Index(['Sleepy', 'Happy', 'Doc', 'Grumpy', 'Bashful', 'Sneezy', 'Dopey',
       'Snow White'],
      dtype='object', name='Student')
['Sleepy' 'Happy' 'Doc' 'Grumpy' 'Bashful' 'Sneezy' 'Dopey' 'Snow White']
Index(['quiz1', 'midterm', 'quiz2', 'final'], dtype='object')
['quiz1' 'midterm' 'quiz2' 'final'] 

(8, 4)
8 

         quiz1  midterm  quiz2  final
Student                              
Sleepy      43     34.0     34     35
Happy       33     20.0     23     49
Doc         48     32.0     36     37
Grumpy      40     23.5     40     45
Bashful     46     42.5     46     31 



Unnamed: 0_level_0,quiz1,midterm,quiz2,final
Student,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Sneezy,48,38.5,48,43
Dopey,38,45.0,39,32
Snow White,41,48.0,39,41


### Data Analysis

Basic __statistics__

In [109]:
# 14. We can get all the basic stats in one method
gradebook.describe()

# Review statistics and data analysis:
# You are the teacher for this class, and as a good teacher, you want to improve your 
# class material.
# Run the cell so you can see the statistics for the exams: quiz1, midterm, quiz2, final
# Using the statistics, you will need to improve the class material for which exam?
# Explain your choice by citing specific statistic values to explain how they show the 
# the need to improve.
# Answer:
# Improve the midterm material
# - lowest mean, and lowest 50th percentile, which means in general the scores are lower
#   than other exams
# - largest standard deviation, which means the grasp of material is not equal
#   for all students
gradebook.describe().max

Unnamed: 0,quiz1,midterm,quiz2,final
count,9.0,9.0,9.0,9.0
mean,40.777778,34.833333,37.222222,38.111111
std,6.339909,9.549215,7.693793,6.735066
min,30.0,20.0,23.0,30.0
25%,38.0,30.0,34.0,32.0
50%,41.0,34.0,39.0,37.0
75%,46.0,42.5,40.0,43.0
max,48.0,48.0,48.0,49.0


In [96]:
# 15. To get a specific statistic for a specific column, we use numpy
print(np.median(gradebook.quiz1))
print(np.mean(gradebook.quiz2), '\n')
# or pandas
print(gradebook.quiz2.mean(), '\n')

# Can also get all statistics of one column
gradebook.quiz2.describe()

42.0
38.125 

38.125 



count     8.000000
mean     38.125000
std       7.698562
min      23.000000
25%      35.500000
50%      39.000000
75%      41.500000
max      48.000000
Name: quiz2, dtype: float64

In [97]:
# 16. Show a sample of students who earned more than 90% and more than 80% in their final
print(gradebook[gradebook.final > 50*.9], '\n')
print(gradebook[gradebook.final > 50*.8], '\n')

# Show the number of students who earned more than 90% and more than 80% in their final?
print(len(gradebook[gradebook.final > 50*.9]))
print(len(gradebook[gradebook.final > 50*.8]))

         quiz1  midterm  quiz2  final
Student                              
Happy       33     20.0     23     49 

            quiz1  midterm  quiz2  final
Student                                 
Happy          33     20.0     23     49
Grumpy         40     23.5     40     45
Sneezy         48     38.5     48     43
Snow White     41     48.0     39     41 

1
4


Basic __Calculations__

In [98]:
# 17. Assume the midterm and final are each worth 30% of the grade, and quiz1 and quiz2 
# are each worth 20% of the grade. 
# (This means 60% of the grade comes from the midterm and final, and 40% of the grade 
# comes from the quizzes)
# We want to calculate the weighted average of the exams. 
# and we want the score to be out of 100 to make it easier to see the percentage.
wtAvg=(.2 * gradebook.quiz1 + .2 * gradebook.quiz2 + 
       .3 * gradebook.midterm + .3 * gradebook.final)
print(wtAvg)

# Show the wtAvg as a percentage?
# Recall that the raw scores are out of 50.
wtAvg / 50 * 100

Student
Sleepy        36.10
Happy         31.90
Doc           37.50
Grumpy        36.55
Bashful       40.45
Sneezy        43.65
Dopey         38.50
Snow White    42.70
dtype: float64


Student
Sleepy        72.2
Happy         63.8
Doc           75.0
Grumpy        73.1
Bashful       80.9
Sneezy        87.3
Dopey         77.0
Snow White    85.4
dtype: float64

In [99]:
# 18. Sort by a column
print(gradebook, '\n')
print(gradebook.sort_values(by="quiz1"), '\n')
print(gradebook.sort_values(by="quiz1", ascending=False), '\n')

            quiz1  midterm  quiz2  final
Student                                 
Sleepy         43     34.0     34     35
Happy          33     20.0     23     49
Doc            48     32.0     36     37
Grumpy         40     23.5     40     45
Bashful        46     42.5     46     31
Sneezy         48     38.5     48     43
Dopey          38     45.0     39     32
Snow White     41     48.0     39     41 

            quiz1  midterm  quiz2  final
Student                                 
Happy          33     20.0     23     49
Dopey          38     45.0     39     32
Grumpy         40     23.5     40     45
Snow White     41     48.0     39     41
Sleepy         43     34.0     34     35
Bashful        46     42.5     46     31
Doc            48     32.0     36     37
Sneezy         48     38.5     48     43 

            quiz1  midterm  quiz2  final
Student                                 
Doc            48     32.0     36     37
Sneezy         48     38.5     48     43
Bashful     

Changing shape

In [100]:
# 19. Remove rows
gradebook = pd.read_excel("scores.xlsx", index_col='Student')
print(gradebook,'\n')
print(gradebook.drop(["Sneezy","Happy"]),'\n')
print(gradebook,'\n')

gradebook.drop(["Sneezy","Happy"], inplace=True)
print(gradebook,'\n')

            quiz1  midterm  quiz2  final
Student                                 
Sleepy         43     34.0     34     35
Happy          33     20.0     23     49
Doc            48     32.0     36     37
Grumpy         40     23.5     40     45
Bashful        46     42.5     46     31
Sneezy         48     38.5     48     43
Dopey          38     45.0     39     32
Snow White     41     48.0     39     41 

            quiz1  midterm  quiz2  final
Student                                 
Sleepy         43     34.0     34     35
Doc            48     32.0     36     37
Grumpy         40     23.5     40     45
Bashful        46     42.5     46     31
Dopey          38     45.0     39     32
Snow White     41     48.0     39     41 

            quiz1  midterm  quiz2  final
Student                                 
Sleepy         43     34.0     34     35
Happy          33     20.0     23     49
Doc            48     32.0     36     37
Grumpy         40     23.5     40     45
Bashful     

In [101]:
# 20. Remove columns
gradebook.drop(columns=['quiz2'], inplace=True),'\n'
gradebook

Unnamed: 0_level_0,quiz1,midterm,final
Student,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Sleepy,43,34.0,35
Doc,48,32.0,37
Grumpy,40,23.5,45
Bashful,46,42.5,31
Dopey,38,45.0,32
Snow White,41,48.0,41


In [102]:
# 21. Adding from another DataFrame
gradebook = pd.read_excel("scores.xlsx", index_col='Student')
stInfo = pd.read_excel("ids.xlsx", index_col='Student')
data = pd.concat([stInfo, gradebook], axis=1)
print(data)
#data = pd.concat([stInfo, gradebook])  # axis=0
#data

              id  year  quiz1  midterm  quiz2  final
Student                                             
Sleepy      1124     1     43     34.0     34     35
Happy       9559     2     33     20.0     23     49
Doc         7503     2     48     32.0     36     37
Grumpy      5738     3     40     23.5     40     45
Bashful     9208     2     46     42.5     46     31
Sneezy      2823     1     48     38.5     48     43
Dopey       4823     1     38     45.0     39     32
Snow White  9794     2     41     48.0     39     41


In [103]:
# 22. Append data 
# append another DataFrame
gradebook = pd.read_excel("scores.xlsx")
newrow = pd.DataFrame(columns=['Student','quiz1','midterm','quiz2','final'],
                      data=[ ["New Kid",30,30,30,30] ])
print(newrow)
gradebook = gradebook.append(newrow, ignore_index=True)
gradebook

   Student  quiz1  midterm  quiz2  final
0  New Kid     30       30     30     30


Unnamed: 0,Student,quiz1,midterm,quiz2,final
0,Sleepy,43,34.0,34,35
1,Happy,33,20.0,23,49
2,Doc,48,32.0,36,37
3,Grumpy,40,23.5,40,45
4,Bashful,46,42.5,46,31
5,Sneezy,48,38.5,48,43
6,Dopey,38,45.0,39,32
7,Snow White,41,48.0,39,41
8,New Kid,30,30.0,30,30


In [104]:
# append a dictionary
gradebook = pd.read_excel("scores.xlsx")
d = dict(zip(['Student','quiz1','midterm','quiz2','final'],["New Kid",30,30,30,30]))
print(d)
gradebook = gradebook.append(d, ignore_index=True)
gradebook

{'Student': 'New Kid', 'quiz1': 30, 'midterm': 30, 'quiz2': 30, 'final': 30}


Unnamed: 0,Student,quiz1,midterm,quiz2,final
0,Sleepy,43,34.0,34,35
1,Happy,33,20.0,23,49
2,Doc,48,32.0,36,37
3,Grumpy,40,23.5,40,45
4,Bashful,46,42.5,46,31
5,Sneezy,48,38.5,48,43
6,Dopey,38,45.0,39,32
7,Snow White,41,48.0,39,41
8,New Kid,30,30.0,30,30


__groupby__ for data aggregation

In [105]:
# 23
print(data.groupby("year").mean(), '\n')

# The above output shows the mean of the id's, which doesn't make sense.
# Show the mean of the exams only?
print(data[data.columns[1:]].groupby("year").mean())

               id  quiz1    midterm      quiz2      final
year                                                     
1     2923.333333   43.0  39.166667  40.333333  36.666667
2     9016.000000   42.0  35.625000  36.000000  39.500000
3     5738.000000   40.0  23.500000  40.000000  45.000000 

      quiz1    midterm      quiz2      final
year                                        
1      43.0  39.166667  40.333333  36.666667
2      42.0  35.625000  36.000000  39.500000
3      40.0  23.500000  40.000000  45.000000


### Data Cleaning

Missing data or __NaN__

In [106]:
# 24. When data is read in to a DataFrame and some values are missing, the missing values 
# appear as NaN values in the DataFrame. NaN is the IEEE defined value for Not a Number.
data = pd.read_csv("classes.csv")   # empty field in CSV file
print(data, '\n')

# remove data records (rows) with NaN
cleanedData = data.dropna()
print(cleanedData, '\n')

# replace NaN with some default value
subbedData = data.fillna(0)
print(subbedData, '\n')

# check for NaN in the DataFrame
data[data.isna().any(axis=1)]
data.isna().sum()

    Class Days     Time  Number of Units  Number of Students Location
0    CIS3   MW   9:30am              4.0                45.0  De Anza
1  CIS22A   MW  11:30am              4.5                 NaN  De Anza
2  CIS41A  TTH   9:30am              4.5                47.0  De Anza
3  CIS18B   MW   1:30pm              4.5                 NaN  De Anza 

    Class Days    Time  Number of Units  Number of Students Location
0    CIS3   MW  9:30am              4.0                45.0  De Anza
2  CIS41A  TTH  9:30am              4.5                47.0  De Anza 

    Class Days     Time  Number of Units  Number of Students Location
0    CIS3   MW   9:30am              4.0                45.0  De Anza
1  CIS22A   MW  11:30am              4.5                 0.0  De Anza
2  CIS41A  TTH   9:30am              4.5                47.0  De Anza
3  CIS18B   MW   1:30pm              4.5                 0.0  De Anza 



Class                 0
Days                  0
Time                  0
Number of Units       0
Number of Students    2
Location              0
dtype: int64

In [107]:
# 25. NaN with numpy
print(np.median(data['Number of Students']))
print(np.median(cleanedData['Number of Students']), '\n')

# NaN with pandas
print(data['Number of Students'].median())
data['Number of Students'].describe()

nan
46.0 

46.0


count     2.000000
mean     46.000000
std       1.414214
min      45.000000
25%      45.500000
50%      46.000000
75%      46.500000
max      47.000000
Name: Number of Students, dtype: float64

Change column labels: __string vectorization__

In [108]:
# 26. As seen from the cell above, it's more convenient to have a shorter column label.
# Simplify the data.columns (column labels) so it's easier to type.
# a. change the column labels so they're all lowercase
print(data)
data.columns = data.columns.str.lower()
print(data.columns)
# b. change column labels to 1 word: class, days, time, units, students ?
# You'll need to do the reading for this answer
data.columns = data.columns.str.extract('([a-z]+)$',expand=False)
print(data.columns)
data

    Class Days     Time  Number of Units  Number of Students Location
0    CIS3   MW   9:30am              4.0                45.0  De Anza
1  CIS22A   MW  11:30am              4.5                 NaN  De Anza
2  CIS41A  TTH   9:30am              4.5                47.0  De Anza
3  CIS18B   MW   1:30pm              4.5                 NaN  De Anza
Index(['class', 'days', 'time', 'number of units', 'number of students',
       'location'],
      dtype='object')
Index(['class', 'days', 'time', 'units', 'students', 'location'], dtype='object')


Unnamed: 0,class,days,time,units,students,location
0,CIS3,MW,9:30am,4.0,45.0,De Anza
1,CIS22A,MW,11:30am,4.5,,De Anza
2,CIS41A,TTH,9:30am,4.5,47.0,De Anza
3,CIS18B,MW,1:30pm,4.5,,De Anza


Remove unnecessary columns

In [84]:
# 27. the location is always De Anza, which doesn't give us any info.
# remove location column ?
data = data.drop(columns=['location'])
data

Unnamed: 0,class,days,time,units,students
0,CIS3,MW,9:30am,4.0,45.0
1,CIS22A,MW,11:30am,4.5,
2,CIS41A,TTH,9:30am,4.5,47.0
3,CIS18B,MW,1:30pm,4.5,


In [10]:
# 28. Convert to numpy array
gradebook = pd.read_csv("scores.csv", header=0, names=["q1","midt","q2","final"])
print(gradebook, '\n')
arr = gradebook.values
print(type(arr))
print(arr)

            q1  midt  q2  final
Sleepy      43  34.0  34     35
Happy       33  18.0  23     50
Doc         48  42.0  36     37
Grumpy      40  23.5  40     45
Bashful     46  42.5  46     41
Sneezy      48  39.5  48     43
Dopey       38  45.0  39     32
Snow White  41  44.0  39     41 

<class 'numpy.ndarray'>
[[43.  34.  34.  35. ]
 [33.  18.  23.  50. ]
 [48.  42.  36.  37. ]
 [40.  23.5 40.  45. ]
 [46.  42.5 46.  41. ]
 [48.  39.5 48.  43. ]
 [38.  45.  39.  32. ]
 [41.  44.  39.  41. ]]


In [13]:
# 29. Get a random sample
sample = gradebook.sample(n=3)
sample

Unnamed: 0,q1,midt,q2,final
Doc,48,42.0,36,37
Grumpy,40,23.5,40,45
Dopey,38,45.0,39,32
