In this section, we will explore ways to bring data into jupyter notebooks and also how to export it back out. After we perform the data analysis, how do we write our DataFrame to a csv file, or Excel file or another format. 

Topics covered in this section:
- Pass a URL to the pd.read_csv Method.
- Quick object conversion. 
- Export CSV file with the to_csv method. 
- Install xlrd and openpyxl libraries to read and write Excel files.
- Import Excel file into pandas with the read_excel method.
- Export Excel file with the to_excel method.

In [1]:
import pandas as pd
pd.__version__

'1.1.3'

Pass a URL to the pd.read_csv Method

We will have a look at passing a URL to the pd.read_csv method to get pandas to downloand a dataset for us from the Internet to the local Jupyter Notebook. The advantage of this method comes when trying to download datasets from a website which is continually updated. Rather than downloading the most updated manually, by providing a link to where that CSV is located, as it's being updated, whenever we run the jupyter notebook file, we will always get the latest batch of data. 

In [5]:
url = "https://data.cityofnewyork.us/api/views/25th-nujf/rows.csv"
baby_names = pd.read_csv(url)
baby_names.head(6)

# As the csv in that link is updated, we will get the most updated version every time we run this command. 

Unnamed: 0,Year of Birth,Gender,Ethnicity,Child's First Name,Count,Rank
0,2011,FEMALE,HISPANIC,GERALDINE,13,75
1,2011,FEMALE,HISPANIC,GIA,21,67
2,2011,FEMALE,HISPANIC,GIANNA,49,42
3,2011,FEMALE,HISPANIC,GISELLE,38,51
4,2011,FEMALE,HISPANIC,GRACE,36,53
5,2011,FEMALE,HISPANIC,GUADALUPE,26,62


Quick Object Conversions

We will take a look at how we can convert a pandas object such as a series or a DataFrane to a vanilla Python object such as a list, dictionary or string.

In [6]:
url = "https://data.cityofnewyork.us/api/views/25th-nujf/rows.csv"
baby_names = pd.read_csv(url)
baby_names.head(6)

Unnamed: 0,Year of Birth,Gender,Ethnicity,Child's First Name,Count,Rank
0,2011,FEMALE,HISPANIC,GERALDINE,13,75
1,2011,FEMALE,HISPANIC,GIA,21,67
2,2011,FEMALE,HISPANIC,GIANNA,49,42
3,2011,FEMALE,HISPANIC,GISELLE,38,51
4,2011,FEMALE,HISPANIC,GRACE,36,53
5,2011,FEMALE,HISPANIC,GUADALUPE,26,62


In [7]:
baby_names["Child's First Name"].to_frame()

# Grabbed the "Child's First Name" column as a series and converted it to a DataFrame using .to_frame(). 

# Note: If you want to convert a series to a DataFrame, it's easier to merge two DataFrames together than it is to merge a 
# DataFrame with a series. 

0        GERALDINE
1              GIA
2           GIANNA
3          GISELLE
4            GRACE
           ...    
29459       Alayna
29460      Yaritza
29461       Mendel
29462        Isaac
29463      Alessia
Name: Child's First Name, Length: 29464, dtype: object

In [9]:
# Convert the DataFrame/series to a list. 
baby_names["Child's First Name"].tolist()

# Convert the DataFrame/series to a dictionary.
baby_names["Child's First Name"].to_dict()
# Returns a dictionary, using the index of the series as the key and the values of the series as the values of the dictionary. 
# Note: Python dictionaries prohibits duplicate keys. If there are duplcate index in the series, it might only take one. So, make sure
# the series index is unique before converting to a dictionary. 

['GERALDINE',
 'GIA',
 'GIANNA',
 'GISELLE',
 'GRACE',
 'GUADALUPE',
 'HAILEY',
 'HALEY',
 'HANNAH',
 'HAYLEE',
 'HAYLEY',
 'HAZEL',
 'HEAVEN',
 'HEIDI',
 'HEIDY',
 'HELEN',
 'IMANI',
 'INGRID',
 'IRENE',
 'IRIS',
 'ISABEL',
 'ISABELA',
 'ISABELLA',
 'ISABELLE',
 'ISIS',
 'ITZEL',
 'IZABELLA',
 'JACQUELINE',
 'JADA',
 'JADE',
 'JAELYNN',
 'JAMIE',
 'JANELLE',
 'JASLENE',
 'JASMIN',
 'JASMINE',
 'JAYDA',
 'JAYLA',
 'JAYLAH',
 'JAYLEEN',
 'JAYLENE',
 'JAYLIN',
 'JAYLYN',
 'JAZLYN',
 'JAZMIN',
 'JAZMINE',
 'JENNIFER',
 'JESSICA',
 'JIMENA',
 'JOCELYN',
 'JOHANNA',
 'JOSELYN',
 'JULIA',
 'JULIANA',
 'JULIANNA',
 'JULIET',
 'JULIETTE',
 'JULISSA',
 'KAELYN',
 'KAILEY',
 'KAILYN',
 'KAITLYN',
 'KAMILA',
 'KAREN',
 'KARLA',
 'KATE',
 'KATELYN',
 'KATELYNN',
 'KATHERINE',
 'KATIE',
 'KAYLA',
 'KAYLEE',
 'KAYLEEN',
 'KAYLEIGH',
 'KAYLIE',
 'KAYLIN',
 'KEILY',
 'KELLY',
 'KEYLA',
 'KHLOE',
 'KIARA',
 'KIMBERLY',
 'KRYSTAL',
 'KYLEE',
 'KYLIE',
 'LAILA',
 'LAURA',
 'LAUREN',
 'LAYLA',
 'LEA',
 'L

In [30]:
# Convert series to a string. With the conditions that the names are sorted in alphabetical order, duplicates removed and
# names are capitalized (only first letter capitalized). 

" ".join(["Leon", "is", "a data scientist"])
# We can use the .join() method to join strings together. 

", ".join(baby_names["Child's First Name"].str.title().drop_duplicates().sort_values())

# We achieved the operation in one line by using the following methods:
# 1. str.title(): capitalizes the first letter of every word.
# 2. drop_duplicates(): removes the duplicates from the string. After normalizing the characters (same capitalization format).
# 3. sort_values(): sorts values alphabetically for strings. Use after dropping duplicates to increase efficiency. 
# 4. ", ".join(): joins all the strings together and is separated by ", ". 

"Aaliyah, Aarav, Aaron, Aayan, Abby, Abdiel, Abdoul, Abdoulaye, Abdul, Abdullah, Abel, Abigail, Abraham, Ada, Adam, Adan, Addison, Adelaide, Adele, Adeline, Aden, Adina, Aditya, Adonis, Adrian, Adriana, Adrianna, Adriel, Aharon, Ahmad, Ahmed, Ahron, Ahuva, Aicha, Aidan, Aiden, Aileen, Aimee, Aisha, Aissatou, Aiza, Akiva, Alan, Alana, Alani, Albert, Alberto, Aldo, Alec, Aleena, Alejandra, Alejandro, Aleksander, Alessandra, Alessia, Alex, Alexa, Alexander, Alexandra, Alexandria, Alexia, Alexis, Alfredo, Ali, Alice, Alicia, Alijah, Alina, Alisa, Alisha, Alison, Alisson, Aliyah, Aliza, Allan, Allen, Allison, Allyson, Alma, Alondra, Alpha, Alston, Alter, Alvin, Alyson, Alyssa, Amadou, Amalia, Amanda, Amani, Amar'E, Amare, Amari, Amaya, Amber, Amberly, Amelia, Amelie, Amina, Aminata, Amir, Amira, Amirah, Amiyah, Amrom, Amy, Ana, Anais, Analia, Anastasia, Anaya, Anderson, Andre, Andrea, Andres, Andrew, Andy, Angel, Angela, Angelica, Angelina, Angelique, Angelo, Angely, Angie, Angus, Anika, An

Export CSV File with the to_csv Method

We will take a look at how we can export a csv file from a DataFrame using the to_csv method. This is the reverse of the pd.read_csv() method we have been using. 

In [31]:
url = "https://data.cityofnewyork.us/api/views/25th-nujf/rows.csv"
baby_names = pd.read_csv(url)
baby_names.head(6)

Unnamed: 0,Year of Birth,Gender,Ethnicity,Child's First Name,Count,Rank
0,2011,FEMALE,HISPANIC,GERALDINE,13,75
1,2011,FEMALE,HISPANIC,GIA,21,67
2,2011,FEMALE,HISPANIC,GIANNA,49,42
3,2011,FEMALE,HISPANIC,GISELLE,38,51
4,2011,FEMALE,HISPANIC,GRACE,36,53
5,2011,FEMALE,HISPANIC,GUADALUPE,26,62


In [32]:
# Export DataFrame using the to_csv() method:

baby_names.to_csv("NYC_Baby_Names.csv")

# The only argument we provided here in the to_csv() method is the file name we want to save the file as. 
# Note: The .csv file format must be provided in the name. Also, good practice to avoid spaces in files names. 
# No output but pd will write the DataFrame's content into a csv file. Saved in same folder as notebook containing the DataFrame. 

10002

In [33]:
# Index parameter

baby_names.to_csv("NYC_Baby_Names.csv", index = False)
# By default index = True. If we don't want to include the index, then set index = False. 
# Pandas will overwrite the existing file with the same file name. 

In [36]:
# Customizing the columns we want to include.

# Example: Include only the Gender, Ethnicity and the Child's First Name column. 

baby_names.to_csv("NYC_Baby_Names.csv", index = False, columns = ["Gender", "Ethnicity", "Child's First Name"])

# By default, all columns will be exported. The columns parameter allows us to limit and specify the columns we want to export
# to the csv file. 

In [38]:
# Dealing with encoding errors: running into character that pandas doesn't know how to write. 

# Encoding parameter set to "utf-8" which is a popular character encoding scheme which supports all the English characters,
# Greek and Latin characters and most of the common characters seen in the Western world. 

baby_names.to_csv("NYC_Baby_Names.csv", index = False, columns = ["Gender", "Ethnicity", "Child's First Name"], encoding = "utf-8")

# If encoding error remains, then use google / stackexchange to find the correct encoding format (rather than utf-8) for that
# specific character. 

First, we install xlrd and openpyxl libraries to read and write Excel files using the Anaconda Prompt.  

Once installed, we will learn how to read Excel files into pandas DataFrame's and writing pandas DataFrame's to Excel files. 

Import Excel File into pandas

In [4]:
# Single worksheet in the Excel file. 

df = pd.read_excel("Data - Single Worksheet.xlsx", engine = "openpyxl")
df

# Since there is only one worksheet in this Excel file, pandas targets that worksheet and brings it in as a DataFrame. 

Unnamed: 0,First Name,Last Name,City,Gender
0,Brandon,James,Miami,M
1,Sean,Hawkins,Denver,M
2,Judy,Day,Los Angeles,F
3,Ashley,Ruiz,San Francisco,F
4,Stephanie,Gomez,Portland,F


In [5]:
# Parameters of read_excel(): very familiar parameters to the read_csv, we arejust dealing with a different file type. 
# index_col : allows us to specify which columns to use as the index. 
# usecols : allows us to specify which columns to include in the import. 
# squeeze : to import as a series rather than a DataFrame. 

In [22]:
# Workbook with multiple worksheets. 

# Importing specific worksheets: 
pd.read_excel("Data - Multiple Worksheets.xlsx", engine = "openpyxl")
# Without specification, pandas will target the first worksheet in the workbook.

# We will need to utilize the sheet_name (= 0 by default) parameter to specify which worksheets to include. Each worksheet is
# assigned an index position. First worksheet has index position 0, second worksheet has index position 1 and so on. 
# Alternatively, we can directly provide the name of the worksheet rather than the index position. 

pd.read_excel("Data - Multiple Worksheets.xlsx", engine = "openpyxl", sheet_name = 1)
pd.read_excel("Data - Multiple Worksheets.xlsx", engine = "openpyxl", sheet_name = "Data 2")

# Both lines of code returns the same result. Returns the second worksheet (tab) from the file. 

Unnamed: 0,First Name,Last Name,City,Gender
0,Parker,Power,Raleigh,F
1,Preston,Prescott,Philadelphia,F
2,Ronaldo,Donaldo,Bangor,M
3,Megan,Stiller,San Francisco,M
4,Bustin,Jieber,Austin,F


In [32]:
# Importing multiple worksheets:
# Provide a list to the sheet_name parameter.

# pd.read_excel("Data - Multiple Worksheets.xlsx", engine = "openpyxl", sheet_name = [0, 1])
data = pd.read_excel("Data - Multiple Worksheets.xlsx", engine = "openpyxl", sheet_name = ["Data 1","Data 2"])
data
# Same result by using either index position or worksheet name. 
# Imports both worksheets "Data 1" (index position 0) and "Data 2" (index position 1). 

dict

In [37]:
# We are importing two DataFrames, pandas will actually store it as a Python dictionary. 
# The keys of the dictionary represents the worksheets index positions (so we have two keys, 0 and 1). The value for those keys 
# will be the respective DataFrames. 

type(data)

# To access the DataFrame "Data 1" (or index position 0 if we input index position for sheet_name when importing):
data["Data 1"]

# To access the DataFrame "Data 2";
data["Data 2"]

Unnamed: 0,First Name,Last Name,City,Gender
0,Parker,Power,Raleigh,F
1,Preston,Prescott,Philadelphia,F
2,Ronaldo,Donaldo,Bangor,M
3,Megan,Stiller,San Francisco,M
4,Bustin,Jieber,Austin,F


In [39]:
# Import all worksheets from workbook. 

pd.read_excel("Data - Multiple Worksheets.xlsx", engine = "openpyxl", sheet_name = None)
# Imports all the worksheets in the workbook (there are two in this case) and inputs the names of the worksheets as the keys. 
# The values will be the respective DataFrames storing the data from those worksheets. 

{'Data 1':   First Name Last Name           City Gender
 0    Brandon     James          Miami      M
 1       Sean   Hawkins         Denver      M
 2       Judy       Day    Los Angeles      F
 3     Ashley      Ruiz  San Francisco      F
 4  Stephanie     Gomez       Portland      F,
 'Data 2':   First Name Last Name           City Gender
 0     Parker     Power        Raleigh      F
 1    Preston  Prescott   Philadelphia      F
 2    Ronaldo   Donaldo         Bangor      M
 3      Megan   Stiller  San Francisco      M
 4     Bustin    Jieber         Austin      F}

Export Excel File

In previous lesson, we learnt how to import an Excel workbook into a pandas object such as a DataFrame or a dictionary of DataFrames. 

In this lesson, we will explore the process in reverse. We will take a pandas object and write them into an Excel .xlsx file. 

Steps to creating an Excel file and exporting pandas DataFrames onto that file:
- Step 1: Create an Excel writer object. This is going to be a container that we are going to configure to export one or more worksheets. 
- Step 2: Storing the DataFrames into the Excel Writer object and describing how we want it to work. 
- Step 3: Save using the .save() method. 

In [40]:
url = "https://data.cityofnewyork.us/api/views/25th-nujf/rows.csv"
baby_names = pd.read_csv(url)
baby_names.head(6)

Unnamed: 0,Year of Birth,Gender,Ethnicity,Child's First Name,Count,Rank
0,2011,FEMALE,HISPANIC,GERALDINE,13,75
1,2011,FEMALE,HISPANIC,GIA,21,67
2,2011,FEMALE,HISPANIC,GIANNA,49,42
3,2011,FEMALE,HISPANIC,GISELLE,38,51
4,2011,FEMALE,HISPANIC,GRACE,36,53
5,2011,FEMALE,HISPANIC,GUADALUPE,26,62


In [50]:
# Create two DataFrames and we will write each one onto a separate worksheet within the eventual Excel workbook.
# We will split by gender.

girls = baby_names[baby_names["Gender"] == "FEMALE"]
boys = baby_names[baby_names["Gender"] == "MALE"]
girls
boys

Unnamed: 0,Year of Birth,Gender,Ethnicity,Child's First Name,Count,Rank
545,2011,MALE,ASIAN AND PACIFIC ISLANDER,AARAV,15,51
546,2011,MALE,ASIAN AND PACIFIC ISLANDER,AARON,51,19
547,2011,MALE,ASIAN AND PACIFIC ISLANDER,ABDUL,20,46
548,2011,MALE,ASIAN AND PACIFIC ISLANDER,ABDULLAH,30,36
549,2011,MALE,ASIAN AND PACIFIC ISLANDER,ADAM,28,38
...,...,...,...,...,...,...
29455,2016,MALE,WHITE NON HISPANIC,Cheskel,20,89
29456,2015,MALE,WHITE NON HISPANIC,Arthur,29,75
29458,2015,MALE,WHITE NON HISPANIC,Joel,32,72
29461,2015,MALE,WHITE NON HISPANIC,Mendel,42,64


In [54]:
# First step: Create an Excel writer object. This is going to be a container that we are going to configure to export one or more
# worksheets. 

# pd.ExcelWriter : this is a constructor we're creating a new Excel Writer in the same way that we create a new series or a new
# DataFrame in pandas.

# We need to provide the Excel file name for the .xlsx we are about to create. Note, the .xlsx extension must also be provided. 

excel_file = pd.ExcelWriter("Baby_Names.xlsx")

# By itself, when we execute this cell, we're not actually going to save this file yet. We're creating a foundation (or base) upon which
# we can build the eventual workbook that we're going to save. 

In [61]:
# Second step: Storing the DataFrames into the Excel Writer object and describing how we want it to work. 

# Exporting the DataFrames and defining (some key) parameters.

girls.to_excel(excel_writer = excel_file, sheet_name = "Girls", index = False)

# excel_writer : this is the Excel Writer (workbook) we want to export DataFrames to, which we have just created. 
# sheet_name : the name we want to call the worksheet with this DataFrame.
# index : whether we want to include the index in the Excel file, set to False if unwanted. 
# columns : specify the columns to include. If no argument provided, then all columns will be included by default. 


In [62]:
boys.to_excel(excel_writer = excel_file, sheet_name = "Boys", index = False, columns = ["Year of Birth", "Gender", "Ethnicity"])

# By running the to_excel cells, we still have not exported and created the DataFrames into the Excel file. 

In [64]:
# Step 3: Save. 

# Once we have configured the excel writer object and once we've stored all the DataFrames within it and described how we 
# want them to work (e.g. sheet names, columns, indexes etc.). Then, we can call a method on the Excel writer object called Save. 

excel_file.save()

# After using the .save() method, this is the point in which we have actually created the Excel workbook file with our DataFrames  
# and it will appear in the file.  