# Module 1.2 - Explore Data

In this lesson, you will learn how to do exploratory data analysis (EDA). In the EDA phase, you need to understand what information is in your dataset in order to later clean and prepare the data for predictive analysis.

## Python libraries

Libraries have pre-defined code for other functions that are not included in basic Python. Once a library has been imported, any of its functions can be used throughout the entire notebook. Some libraries have long names that may be easy to misspell when using its functions, so it is common practice to assign a library to an alias name using `as`.

In [2]:
import numpy as np  # numpy used for linear algebra functions
import pandas as pd # pandas used to explore and manipulate dataframes

## pandas and numpy
- `pandas`: Tools for reading and writing  data between in-memory data structures and different formats: CSV and text files, 
Microsoft Excel, SQL databases. It takes data (like a CSV or TSV file, or a SQL database)  and creates a Python object with rows and columns called data frame.

- `numpy`: is an open source numerical Python library.It can be utilised to perform a number of mathematical operations on arrays such as trigonometric, statistical and algebraic routines.

## Load data

This notebook is a file in which we will explore the data but before that can be done, we must first read in a copy of it from the file that it's stored in. The `pandas` library will read the information in a file and then format it into a row-column structure called a ***dataframe***. Dataframes are similar to a table or spreadsheet and can be used within Python to explore and manipulate items in the rows or columns.

### CSV files

CSV stands for "comma separated values". It is a plain text file where each value in a line is separated by some delimiter (usually commas but can be tabs, semicolons, spaces, etc.). In `pandas` the function to read data from a CSV file is `.read_csv(name of file)`. The path to the file with data should be typed inside the parentheses (as a string) of the `.read_csv()` function. If the data file and the notebook file are located in the same folder, then just the name of the file is required. The `.read_csv()` function assumes that the first line of information in the file will be the column header names for the dataset. 

In [2]:
#load a csv file named "smallgradesh.csv" located in the datasets folder

location = "datasets/smallgradesh.csv"
df = pd.read_csv(location)

### Dataframes

Unlike spreadsheets in tools such as Microsoft Excel, dataframes are not meant to be viewed in full detail (meaning, scroll through every single row of information). Instead, to verify that the data was loaded in properly, the `.head()` and `.tail()` functions show several rows and columns within the dataset. By default, the `.head()` and `.tail()` functions show the first/last five rows of the dataset. The number of rows shown can be altered by setting the number inside the function parentheses. However, if the number of rows requested is excessive, then the dataframe will only show several rows from the top of the dataframe, skip the middle section, and then show the last few rows. Additional features of the data can be verified using other exploratory functions to determine if any data formatting needs to be done.

In [3]:
# this will show first 5 rows in the dataframe
df.head()

Unnamed: 0,Name,Grade
0,Marcia,82.4
1,Kadeem,78.2
2,Nash,79.3
3,Noelani,83.2
4,Noelani,87.4


*NOTE*: In Jupyter notebook, whenever you do not remember what a function does or need to know what parameters/arguments can be used in it, just type the function name witha question mark at the end (no parentheses) and run the cell.

In [4]:
#df.head?

In [5]:
# this will show the last 3 rows of the dataframe
df.tail(3)

Unnamed: 0,Name,Grade
1997,Mercedes,84.9
1998,Lucius,69.1
1999,Linus,79.6


### Data without headers

If the file (CSV, Excel, etc.) does not contain column header names, then the argument `header=None` is used to read the first line in the file as the first row in the dataframe. In addition, you can add column header names while reading in the data using the `names=[list of column names]` argument in the function. Column headers can also be added after the data has been loaded as a dataframe - first, type the dataframe variable name and use the `.columns` function (this will access the names of the columns). Then on the right side of an equal sign (assignment operator) type a Python list of all of the new header names in the position order of the columns (this also works for renaming columns).

In [6]:
location2 = "datasets/smallgrades.csv"

# CSV file that does not have column header names
# Python creates index position placeholder names for the columns

df_nohead = pd.read_csv(location2, header=None) #try w/o header=None

df_nohead.head()

Unnamed: 0,0,1
0,Marcia,82.4
1,Kadeem,78.2
2,Nash,79.3
3,Noelani,83.2
4,Noelani,87.4


In [7]:
# from the same file in the previous cell
# this time we will add on column header names as the data is loaded

df_during = pd.read_csv(location2, names=['Name', 'Grade'])

df_during.head()

Unnamed: 0,Name,Grade
0,Marcia,82.4
1,Kadeem,78.2
2,Nash,79.3
3,Noelani,83.2
4,Noelani,87.4


In [8]:
#verify that this dataframe has no column header names
df_nohead.head()

Unnamed: 0,0,1
0,Marcia,82.4
1,Kadeem,78.2
2,Nash,79.3
3,Noelani,83.2
4,Noelani,87.4


In [9]:
# add header names to "df_nohead" dataframe

df_nohead.columns = ['Name', 'Grade']
df_nohead.head()

Unnamed: 0,Name,Grade
0,Marcia,82.4
1,Kadeem,78.2
2,Nash,79.3
3,Noelani,83.2
4,Noelani,87.4


### Excel files

In `pandas` the function to read in data from an Excel file is `.read_excel()` and works similarly to the `.read_csv()` function.

In [10]:
#import data from Excel file

location = "datasets/gradedata.xlsx"
df = pd.read_excel(location) #overwrites the info from the df variable in the examples above
df.head()

Unnamed: 0,fname,lname,gender,age,exercise,hours,grade,address
0,Marcia,Pugh,female,17,3,10,82.4,"7379 Highland Rd. , Dublin, GA 31021"
1,Kadeem,Morrison,male,18,4,4,78.2,"8 Bayport St. , Honolulu, HI 96815"
2,Nash,Powell,male,18,5,9,79.3,"Encino, CA 91316, 3 Lilac Street"
3,Noelani,Wagner,female,14,2,7,83.2,"Riverview, FL 33569, 9998 North Smith Dr."
4,Noelani,Cherry,female,18,4,15,87.4,"97 SE. Ocean Street , Bethlehem, PA 18015"


## Make a file from dataframe

Usually after the data has been manipulated and transformed into a format that the user is satisfied with, it may be the case that the information is going to be used within another program (Excel, Tableau, etc.). Because the data cannot be directly sent from Jupyter notebook to another application, we can create a file from the dataframe. CSV is the easiest format to use for the data but other file formats can be used as well (Excel, JSON). 

#### Zip function
- Is a Python built-in function that allows us to combine corresponding elements from multiple sequences into a single list of tuples.

In [11]:
# create fake data

names = ['Bob','Jessica','Mary','John','Mel']
grades = [76,95,77,78,99]
GradeList = list(zip(names,grades))

df = pd.DataFrame(data = GradeList, columns=['Names','Grades'])

### Export data to CSV file

In [12]:
# export data to a file called "studentgrades.csv"
# file will be created in the same folder as the notebook, unless specified

df.to_csv('studentgrades.csv',index=False,header=False)

### Export data to Excel file

In [13]:
# create Excel file from dataframe
# "engine" parameter tells file what Excel format encoding (.xls, .xlsx) to create

writer = pd.ExcelWriter('dataframe.xlsx', engine='xlsxwriter')
df.to_excel(writer, sheet_name='Sheet1')
writer.save()

In [14]:
# export different dataframes to separate sheets in same Excel file

writer = pd.ExcelWriter('dataframe.xlsx', engine='xlsxwriter')

df.to_excel(writer, sheet_name='Sheet1')
df_nohead.to_excel(writer, sheet_name='Sheet2')

writer.save()

## Data Features

The first steps in the EDA process should be to verify that the data is properly loaded (using `.head()` or `.tail()`), checking the number of rows and columns, finding out what columns exist in the dataset, and also the data type of each column.

### Exploratory Analysis
It is use for gaining a better understanding of data aspects like:
- main features of data
- variables and relationships that hold between them
- identifying which variables are important for our problem

In [15]:
#load gradedata.csv file

location = "datasets/gradedata.csv"
df = pd.read_csv(location)

df.head()

Unnamed: 0,fname,lname,gender,age,exercise,hours,grade,address
0,Marcia,Pugh,female,17,3,10,82.4,"9253 Richardson Road, Matawan, NJ 07747"
1,Kadeem,Morrison,male,18,4,4,78.2,"33 Spring Dr., Taunton, MA 02780"
2,Nash,Powell,male,18,5,9,79.3,"41 Hill Avenue, Mentor, OH 44060"
3,Noelani,Wagner,female,14,2,7,83.2,"8839 Marshall St., Miami, FL 33125"
4,Noelani,Cherry,female,18,4,15,87.4,"8304 Charles Rd., Lewis Center, OH 43035"


In [16]:
# total number of rows in dataframe
len(df)

2000

In [17]:
# show the number of rows and columns in a dataframe
# format is (# rows, # columns)
df.shape

(2000, 8)

In [18]:
#show the column names in the dataset
#or print(df.columns)
df.columns

Index(['fname', 'lname', 'gender', 'age', 'exercise', 'hours', 'grade',
       'address'],
      dtype='object')

In [19]:
#show the data type of each column
df.dtypes

fname        object
lname        object
gender       object
age           int64
exercise      int64
hours         int64
grade       float64
address      object
dtype: object

In [20]:
#number of non-null (missing) values in each column
df.count()

fname       2000
lname       2000
gender      2000
age         2000
exercise    2000
hours       2000
grade       2000
address     2000
dtype: int64

## Descriptive Statistics

Descriptive statistics, also called summary statistics, are a way to get information about the characteristics of a column. Common features include the maximum and minimum values, the total (sum) of all the values, and the mean or median (measures of central tendency).

To access a column in the dataframe use the dataframe variable name followed by a pair of square brackets with the name of the column (as a string) inside of it (similar to getting the value from a dictionary using a key).

In [21]:
# df_name['column name'] to access a column

# maximum age (oldest student)
df['age'].max()

19

In [22]:
# minimum age (youngest student)
df['age'].min()

14

In [23]:
# total sum of ages
df['age'].sum()

33157

### Mean
Refers to the mean or average that is used to derive the central tendency of the data in question. It is determined by adding all the data points in a population and then dividing the total by the number of points. The resulting number is known as the mean or the average.


In [24]:
# mean age (arithmetic average)
df['age'].mean()

16.5785

### Median
 - The "middle" of a sorted list of numbers. 
 - If the number of terms is odd, then the median is the value of the term in the middle.
 - If the number of terms is even, then the median is the average of the two terms in the middle

In [25]:
# middle average age (median)
df['age'].median()

17.0

### Mode
- Is the value of the term that occurs the most often.
- Sometimes there is more than one mode. This happens when two or more terms occur with equal frequency, and more often than any of the others

In [26]:
# mode age (most frequently occuring) of all students
# can have two or more most frequent values
df['age'].mode()

0    17
dtype: int64

### Standard deviation
- Is a measure of the amount of variation or dispersion of a set of values.[1] A low standard deviation indicates that the values tend to be close to the mean (also called the expected value) of the set, while a high standard deviation indicates that the values are spread out over a wider range.
- In statistics, the 68–95–99.7 rule, also known as the empirical rule, is a shorthand used to remember the percentage of values that lie within a band around the mean in a normal distribution 

Calculate Standard Deviation

<img src=" https://www.thoughtco.com/thmb/GVy7U2pn2FP2Tut4MG53ZiCgOCQ=/768x0/filters:no_upscale():max_bytes(150000):strip_icc():format(webp)/calculate-a-sample-standard-deviation-3126345-v4-CS-01-5b76f58f46e0fb0050bb4ab2.png">


- The mean value is calculated by adding all the data points and dividing by the number of data points.
- The variance for each data point is calculated, first by subtracting the value of the data point from the mean. Each of those resulting values is then squared and the results summed. The result is then divided by the number of data points less one.
- The square root of the variance—result from no. 2—is then taken to find the standard deviation.




In [27]:
# standard deviation (distance away from mean)
df['age'].std()

1.6962536332147102

#### Describe function
df.describe()

- NaN values has been ignored while calculating these statistical values.
- It excludes character column and calculate summary statistics only for numeric columns
- Function gives the mean, std and IQR values. 
- The interquartile range (IQR) is the difference between the 75th and 25th percentile of the data. It is a measure of the dispersion similar to standard deviation or variance, but is much more robust against outliers.

In [28]:
#descriptive statistics
df.describe()

Unnamed: 0,age,exercise,hours,grade
count,2000.0,2000.0,2000.0,2000.0
mean,16.5785,3.0005,10.9885,82.55605
std,1.696254,1.423205,4.063942,9.747593
min,14.0,0.0,0.0,32.0
25%,15.0,2.0,8.0,75.575
50%,17.0,3.0,11.0,82.7
75%,18.0,4.0,14.0,89.7
max,19.0,5.0,20.0,100.0


In [29]:
#unique ages (distinct values)
df['age'].unique()

array([17, 18, 14, 16, 15, 19], dtype=int64)

In [30]:
# how many students are in each age group
# first value is the mode
df['age'].value_counts()

17    356
18    349
19    345
16    334
14    316
15    300
Name: age, dtype: int64

### Summary statistics by group categories

While descriptive statistics can show a general overview about a dataset, calculating statistics for different categories within a feature (column) can identify patterns and trends specific to that group. The `.groupby()` and `.pivot_table()` functions both communicate statistical information within a particular demographic.

#### groupby() function 
- Is used to split the data into groups based on some criteria

In [31]:
# doesn't do anything on its own
# groupby function must be combined with a statistical calculation
df.groupby('gender')

<pandas.core.groupby.generic.DataFrameGroupBy object at 0x000002A481A112C8>

In [32]:
# creates table of averages per gender category for all numerical-type columns 
df.groupby(df['gender']).mean()

Unnamed: 0_level_0,age,exercise,hours,grade
gender,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
female,16.568,3.047,10.932,82.7173
male,16.589,2.954,11.045,82.3948


In [33]:
# average hours per gender group
df.groupby(df['gender'])['hours'].mean()

gender
female    10.932
male      11.045
Name: hours, dtype: float64

In [34]:
# average hours and exercise per gender group
df.groupby(df['gender'])['hours', 'exercise'].mean()

Unnamed: 0_level_0,hours,exercise
gender,Unnamed: 1_level_1,Unnamed: 2_level_1
female,10.932,3.047
male,11.045,2.954


In [35]:
# first group by gender then group by age category
# average hours of study by gender and age
df.groupby(['gender', 'age'])['hours'].mean()

gender  age
female  14     10.658228
        15     11.050633
        16     11.158228
        17     10.943503
        18     10.950549
        19     10.832335
male    14     11.170886
        15     10.866197
        16     11.465909
        17     11.268156
        18     10.856287
        19     10.612360
Name: hours, dtype: float64

### Pivot table
- This tool enabled users to automatically sort, count, total, or average the data stored in one table

In [36]:
 #pivot table default statistical function is mean
pd.pivot_table(df, index=['gender'])

Unnamed: 0_level_0,age,exercise,grade,hours
gender,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
female,16.568,3.047,82.7173,10.932
male,16.589,2.954,82.3948,11.045


In [37]:
# change statistical function to median
pd.pivot_table(df, index=['gender'], aggfunc='median')

Unnamed: 0_level_0,age,exercise,grade,hours
gender,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
female,17,3,82.6,11
male,17,3,83.0,11


In [38]:
# calculate mean for only the "age" column
pd.pivot_table(df, values=['age'], index=['gender'])

Unnamed: 0_level_0,age
gender,Unnamed: 1_level_1
female,16.568
male,16.589


## Missing data

Data is messy, and one of the most common forms is missing information. In the data cleaning phase, we will learn how to handle rows and columns with missing values but first we need to identify where there is missing information. Missing values in Python are typically called ***null values*** and in a dataframe, missing cells have `NaN` (for "not a number") as a visual placeholder. To find the total number of missing values in a column, the `.isnull()` will create a column of True/False values (also called "boolean" values) depending on if the value in that cell is missing (null = True, not null = False). Then use the `.sum()` function on the resulting True/False to get a count of the number of `True` values for the missing data.

In [4]:
# load data from file and show first 5 rows of dataframe

filename = "datasets/gradedatamissing.csv"

df_missing = pd.read_csv(filename)

df_missing.head()

Unnamed: 0,fname,lname,gender,age,exercise,hours,grade,address
0,Marcia,Pugh,female,17.0,3.0,10.0,82.4,"9253 Richardson Road, Matawan, NJ 07747"
1,Kadeem,Morrison,male,18.0,4.0,4.0,78.2,"33 Spring Dr., Taunton, MA 02780"
2,Nash,Powell,male,18.0,5.0,9.0,79.3,"41 Hill Avenue, Mentor, OH 44060"
3,Noelani,Wagner,female,14.0,2.0,7.0,83.2,"8839 Marshall St., Miami, FL 33125"
4,Noelani,Cherry,female,18.0,4.0,15.0,87.4,"8304 Charles Rd., Lewis Center, OH 43035"


In [40]:
# total number of non-missing values in each column
df_missing.count()

fname       2000
lname       2000
gender      2000
age         2000
exercise    2000
hours       2000
grade       2000
address     2000
dtype: int64

In [41]:
# total missing values per column
df_missing.isnull().sum()

fname       0
lname       0
gender      2
age         7
exercise    8
hours       8
grade       2
address     0
dtype: int64

In [5]:

#show rows with missing values
#missing will only show True/False values
missing = df_missing['exercise'].isnull()


# DataFrame.loc[] is a method that takes only index labels and returns row or dataf
df_missing.loc[missing]

Unnamed: 0,fname,lname,gender,age,exercise,hours,grade,address
592,Mark,Hankerson,male,16.0,,17.0,92.1,"3351 Hibiscus St. Miami, Fl 33133"
673,Jeremy,Rowe,male,19.0,,10.0,83.4,"6 Madison St., Fall River, MA 02720"
683,Kalia,Moss,female,,,,,"760 East Stillwater St., Rosedale, NY 11422"
699,Gretchen,Haynes,female,,,,77.2,"500 Border St., New Lenox, IL 60451"
700,Ruth,Bowman,female,,,,100.0,"8621 Shub Farm Ave., Ocean Springs, MS 39564"
701,Eric,Walter,male,,,,93.2,"416 Glendale Ave., Green Cove Springs, FL 32043"
884,Dara,Giles,female,15.0,,17.0,88.4,"164 Shore Street, Wadsworth, OH 44281"
981,Fitzgerald,Barry,male,15.0,,10.0,69.6,"836 N. Miles St., Oak Park, MI 48237"


## Duplication

When looking for duplicates in `pandas`, the `.duplicated()` function by default searches for duplication across an entire row (meaning, all the values in one row are exactly the same as all the values in another row). It produces a column of boolean values (`True`/`False`) indicated if the row is a duplicate. The first time the `.duplicated()` function sees a row of information, it is not considered as duplicate (because it has not seen any other similar rows of data). However, whenever the function comes across the same information again, then that row (and any following rows that contain the same data) are then marked as duplicates. To look for duplicates only for specific columns, then use the argument `.duplicated(subset=[list of column names])`. 

In [42]:
# create fake data with duplicates

names = ['Jessica','John','Bob','Jessica','Mary','John','Mel','Mel']
grades = [95,78,76,95,77,78,99,100]
dupe_data = list(zip(names,grades))

df = pd.DataFrame(data = dupe_data, columns=['Names', 'Grades'])
df

Unnamed: 0,Names,Grades
0,Jessica,95
1,John,78
2,Bob,76
3,Jessica,95
4,Mary,77
5,John,78
6,Mel,99
7,Mel,100


In [43]:
# find the duplicate rows (same information in both columns for 2+ rows)
dupe = df.duplicated()
dupe

0    False
1    False
2    False
3     True
4    False
5     True
6    False
7    False
dtype: bool

In [44]:
# choose the rows where there are True values
# duplicate of Jessica - 95; John - 78
df.loc[dupe]

Unnamed: 0,Names,Grades
3,Jessica,95
5,John,78


In [45]:
# check for duplicates only in the "Names" column (will ignore "Grades")
# Jessica, John, and Mel are name duplicates

dupe_name = df.duplicated(subset=['Names'])
df.loc[dupe_name]

Unnamed: 0,Names,Grades
3,Jessica,95
5,John,78
7,Mel,100


## Subset a dataframe

Dataframe rows and columns can be referred to in two different ways - either by their label or by their index position. Column labels are the header names and are typically purposely named in the dataset. User-defined row labels are less common; if there are not any specific labels given to the rows, then the index position also becomes the label (same goes for column labels). Because Python positions start from zero, the first column or row is always index position zero, no matter what labels are assigned to the row.

When selecting rows or column from a dataframe, it returns a smaller dataframe called a subset, which can then be assigned to its own variable name. If the original dataframe had default labels (using position) assigned to the rows (and columns, if applicable) then the subset labels will be inherited from the original dataframe, even though the index positions are different. For example, if a row had a label `4` and was also in index position `4`, then was assigned to a subset where it is now the first row, then the label would still be `4` but the position is `0`. 

There are two functions that are used to select rows or columns in a dataframe:

- `dataframe_name.loc[row label]`: select rows by label name
- `dataframe_name.iloc[row position]`: select rows by index position number

### Choosing rows

In [46]:
# create dataframe "A"
# dataframe "A" has default index positions as labels

colA = [10, 20, 40, 50]
colB = ['no', 'yes', 'yes', 'no']

A_B = list(zip(colA, colB))

df_A = pd.DataFrame(data=A_B, columns=['A', 'B'])
df_A

Unnamed: 0,A,B
0,10,no
1,20,yes
2,40,yes
3,50,no


In [47]:
# select the row from dataframe "A" with label name 0
df_A.loc[0]

A    10
B    no
Name: 0, dtype: object

In [48]:
# select the row from dataframe "A" with index position 0
df_A.iloc[0]

A    10
B    no
Name: 0, dtype: object

In [49]:
# create dataframe "B"
# dataframe "B" has user-defined labels
index = ['a', 'b', 'c', 'd']
colA = [10, 20, 40, 50]
colB = ['no', 'yes', 'yes', 'no']

df_B = pd.DataFrame(data=A_B, columns=['A', 'B'], index=index)

df_B

Unnamed: 0,A,B
a,10,no
b,20,yes
c,40,yes
d,50,no


In [50]:
# select the row in dataframe "B" with the row label name "b"
df_B.loc['b']

A     20
B    yes
Name: b, dtype: object

In [51]:
# select the row in dataframe "B" with index position 1
# same row as previous cell; different way of selecting it
df_B.iloc[1]

A     20
B    yes
Name: b, dtype: object

#### Let's see how this can get tricky

In [52]:
# dataframe "C" is a copy of dataframe "A"
df_C = df_A.copy()
df_C

Unnamed: 0,A,B
0,10,no
1,20,yes
2,40,yes
3,50,no


In [53]:
# dataframe "D" has index label names that are numbers (but not the default index positions)
index = [2,1,0,3]
colA = [10, 20, 40, 50]
colB = ['no', 'yes', 'yes', 'no']

df_D = pd.DataFrame(data=A_B, columns=['A', 'B'], index=index)
df_D

Unnamed: 0,A,B
2,10,no
1,20,yes
0,40,yes
3,50,no


Rows can also be selected as a group (not just a single row) by including the starting row and ending row with a colon between them inside the square brackets (`dataframe_name[start:end]`). When using the `.iloc` function to choose rows, it works similarly to Python list slicing - it chooses the starting index row but the ending index row position is not inclusive. However, for choosing a group of rows by the label name (`.loc`), the ending row label value is included in the subset.

In [54]:
# choose index position 0 through the index position before 3
df_D.iloc[0:3]

Unnamed: 0,A,B
2,10,no
1,20,yes
0,40,yes


In [55]:
# select label '0' and all the rows that are in between and ending with the row with label '3'
df_D.loc[0:3]

Unnamed: 0,A,B
0,40,yes
3,50,no


In [56]:
# select rows starting from index position 0 through BUT NOT INCLUDING position 3
df_C.iloc[0:3]

Unnamed: 0,A,B
0,10,no
1,20,yes
2,40,yes


In [57]:
# select with index label '0' and everything in between and include the label '3'
df_C.loc[0:3]

Unnamed: 0,A,B
0,10,no
1,20,yes
2,40,yes
3,50,no


### Choosing columns

Just as selecting rows reduces the size of the dataframe into a subset where we can study the particular characteristics of that group, we may also want to keep certain columns from the dataset to make it easier to focus on the relevant information for analysis. To choose columns to keep for the subset dataframe, use the syntax `dataframe_name[[list of columns]]` (a list of column names inside another pair of square brackets to select that list as a dataframe).

In [58]:
# use the gradedata.csv file

location = "datasets/gradedata.csv"
df = pd.read_csv(location)

df.head()

Unnamed: 0,fname,lname,gender,age,exercise,hours,grade,address
0,Marcia,Pugh,female,17,3,10,82.4,"9253 Richardson Road, Matawan, NJ 07747"
1,Kadeem,Morrison,male,18,4,4,78.2,"33 Spring Dr., Taunton, MA 02780"
2,Nash,Powell,male,18,5,9,79.3,"41 Hill Avenue, Mentor, OH 44060"
3,Noelani,Wagner,female,14,2,7,83.2,"8839 Marshall St., Miami, FL 33125"
4,Noelani,Cherry,female,18,4,15,87.4,"8304 Charles Rd., Lewis Center, OH 43035"


In [59]:
# select the columns from the list of names "fname", "lname", "hours", and "grade"
small_df = df[['fname', 'lname', 'hours', 'grade']]

In [60]:
# show first 5 rows of smaller dataframe with those columns only
small_df.head()

Unnamed: 0,fname,lname,hours,grade
0,Marcia,Pugh,10,82.4
1,Kadeem,Morrison,4,78.2
2,Nash,Powell,9,79.3
3,Noelani,Wagner,7,83.2
4,Noelani,Cherry,15,87.4


# Bonus Code!

Although not a requirement for the material that will be covered in this course, below is a code example of a task challenge you may encounter while working on a project.

### Load multiple data files

If you are faced with multiple files with information you want to aggregate into a single dataframe, instead of loading each file individually as their own separate dataframes, the `glob` library will gather a list of file names from a directory and then you can use `pandas` to create the entire dataframe automatically.

In [61]:
import glob #used to read file names in directory

# make an empty dataframe called "all_data"
# this will be used to add each file's data to previous data added
all_data = pd.DataFrame()

# iterate through each Excel file in the datasets directory with a name starting with "data"
for file in glob.glob("datasets/data*.xlsx"):
    
    #make a dataframe with that file's data
    df = pd.read_excel(file)
    
    # append (attach) the small dataframe to the overall dataframe ("all_data")
    all_data = all_data.append(df, ignore_index=True)

# view the number of rows in dataframe
# each data file had 100 rows of information (3 total files)
len(all_data) 

300

In [62]:
all_data.head()

Unnamed: 0,fname,age,grade
0,Baker,14,90
1,Josephine,19,100
2,Calvin,15,66
3,Aretha,17,84
4,Britanney,19,66
