Skip to content

learn-co-students/dsc-pandas-series-and-dataframes-lab-chi01-dtsc-ft-051120

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

25 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Understanding Pandas Series and DataFrames - Lab

Introduction

In this lab, let's get some hands-on practice working with data cleanup using Pandas.

Objectives

You will be able to:

  • Use the .map() and .apply() methods to apply a function to a pandas Series or DataFrame
  • Perform operations to change the structure of pandas DataFrames
  • Change the index of a pandas DataFrame
  • Change data types of columns in pandas DataFrames

Let's get started!

Import the file 'turnstile_180901.txt'.

# Import the required libraries
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
# Import the file 'turnstile_180901.txt'
df = pd.read_csv('turnstile_180901.txt')

# Print the number of rows ans columns in df
print(df.shape)

# Print the first five rows of df
df.head()

Rename all the columns to lower case:

# Rename all the columns to lower case

Change the index to 'linename':

# Change the index to 'linename'

Reset the index:

# Reset the index

Create another column 'Num_Lines' that is a count of how many lines pass through a station. Then sort your DataFrame by this column in descending order.

Hint: According to the data dictionary, LINENAME represents all train lines that can be boarded at a given station. Normally lines are represented by one character. For example, LINENAME 456NQR represents trains 4, 5, 6, N, Q, and R.

# Add a new 'num_lines' column

Write a function to clean column names:

def clean(col_name):
    # Clean the column name in any way you want to. Hint: think back to str methods 
    cleaned = None
    return cleaned
# Use the above function to clean the column names
# Check to ensure the column names were cleaned
df.columns
  • Change the data type of the 'date' column to a date
  • Add a new column 'day_of_week' that represents the day of the week
# Convert the data type of the 'date' column to a date


# Add a new column 'day_of_week' that represents the day of the week 
# Group the data by day of week and plot the sum of the numeric columns
grouped = df.groupby('day_of_week').sum()
grouped.plot(kind='barh')
plt.show()
  • Remove the index of grouped
  • Print the first five rows of grouped
# Reset the index of grouped
grouped = None

# Print the first five rows of grouped

Add a new column 'is_weekend' that maps the 'day_of_week' column using the dictionary weekend_map

# Use this dictionary to create a new column 
weekend_map = {0:False, 1:False, 2:False, 3:False, 4:False, 5:True, 6:True}

# Add a new column 'is_weekend' that maps the 'day_of_week' column using weekend_map
grouped['is_weekend'] = grouped['day_of_week'].map(weekend_map)
# Group the data by weekend/weekday and plot the sum of the numeric columns
wkend = grouped.groupby('is_weekend').sum()
wkend[['entries', 'exits']].plot(kind='barh')
plt.show()

Remove the 'c/a' and 'scp' columns.

# Remove the 'c/a' and 'scp' columns
df = None
df.head(2)

Analysis Question

What is misleading about the day of week and weekend/weekday charts you just plotted?

# Your answer here 

Summary

Great! You practiced your data cleanup skills using Pandas.

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published