# Lecture 1031: Reading in Excel files

In [1]:
import pandas as pd
import re

## pd.read_excel()

On October 31, 2022, I downloaded `export.xls` from [Cal-Access](https://cal-access.sos.ca.gov/Campaign/Committees/Detail.aspx?id=1414018&session=2021&view=received) using the download link that says "DOWNLOAD THESE RESULTS: MICROSOFT EXCEL"

We're going to use the method [`pd.read_excel()`](https://pandas.pydata.org/docs/reference/api/pandas.read_excel.html) to import the data.

In [3]:
# you will get an error when running this
data = pd.read_excel('export.xls')

ValueError: Excel file format cannot be determined, you must specify an engine manually.

## Error 1
The error we got was "Excel file format cannot be determined, you must specify an engine manually." Let's go to the [method's documentation](https://pandas.pydata.org/docs/reference/api/pandas.read_excel.html) and search for `engine`.

```
engine: str, default None
If io is not a buffer or path, this must be set to identify io. Supported engines: “xlrd”, “openpyxl”, “odf”, “pyxlsb”. Engine compatibility :
- “xlrd” supports old-style Excel files (.xls).
- “openpyxl” supports newer Excel file formats.
- “odf” supports OpenDocument file formats (.odf, .ods, .odt).
- “pyxlsb” supports Binary Excel files.
```

This is a .xls file, so we should use the `xlrd` engine.

In [4]:
# you will STILL get an error when running this
data = pd.read_excel('export.xls', engine='xlrd')

ImportError: Missing optional dependency 'xlrd'. Install xlrd >= 1.0.0 for Excel support Use pip or conda to install xlrd.

## Error 2

The error I got here was "Missing optional dependency 'xlrd'. Install xlrd >= 1.0.0 for Excel support Use pip or conda to install xlrd."

So, let's `pip install xlrd`. 

In [None]:
!pip install xlrd

In [5]:
# you will STILL get an error when running this
data = pd.read_excel('export.xls', engine='xlrd')

XLRDError: Unsupported format, or corrupt file: Expected BOF record; found b'"NAME OF'

## Error 3
I still got an error! The error I got was: "Unsupported format, or corrupt file: Expected BOF record; found b'"NAME OF'"

OK, so does anyone know what the problem is?

If you have Excel on your computer, try opening up the file.

![Screenshot of error](xls_error.png "Screenshot of error")

Something is wrong with this file. So when I got that alert, I hit "Yes" then saved the file in Excel as a `.xlsx` file.

(Note: I do not have Excel on my personal laptop, so I did this on my work laptop.)

## Try importing again

In [7]:
# we'll change the engine since `xlrd` doesn't support .xlsx files
data = pd.read_excel('export.xlsx', engine='openpyxl')

ImportError: Missing optional dependency 'openpyxl'.  Use pip or conda to install openpyxl.

ARGH: "ImportError: Missing optional dependency 'openpyxl'.  Use pip or conda to install openpyxl."

In [8]:
!pip install openpyxl

Collecting openpyxl
  Using cached openpyxl-3.0.10-py2.py3-none-any.whl (242 kB)
Collecting et-xmlfile
  Using cached et_xmlfile-1.1.0-py3-none-any.whl (4.7 kB)
Installing collected packages: et-xmlfile, openpyxl
Successfully installed et-xmlfile-1.1.0 openpyxl-3.0.10

[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip available: [0m[31;49m22.2.2[0m[39;49m -> [0m[32;49m22.3[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m


In [10]:
data = pd.read_excel('export.xlsx', engine='openpyxl')

Yay, that worked! Let's see what's inside and be sure to export a CSV of this!!!

## Explore and export

In [11]:
data

Unnamed: 0,NAME OF CONTRIBUTOR,PAYMENT TYPE,CITY,STATE,ZIP,ID NUMBER,EMPLOYER,OCCUPATION,AMOUNT,TRANSACTION DATE,FILED DATE,TRANSACTION NUMBER
0,CALIFORNIA CHIROPRACTIC ASSOCIATION PAC,MONETARY,SACRAMENTO,CA,95814,742986.0,,,-15000.0,2021-09-07,2022-05-20,2625145 - EXP2027
1,MARINA BILAVER,MONETARY,HOLLYWOOD,CA,90028,,,NOT EMPLOYED,10.0,2021-07-01,2022-05-20,2625145 - IDT41416
2,CAMERON BLOOMER,MONETARY,NEW YORK,NY,10010,,BLOOMER BIOTECH,INVESTMENT ADVISOR,100.0,2021-07-01,2022-05-20,2625145 - IDT41417
3,ALISON FLEMMING,MONETARY,LARKSPUR,CA,94904,,COOPER & MCCLOSKEY,INSURANCE BROKER,150.0,2021-07-01,2022-05-20,2625145 - IDT41424
4,WILLIAM FOWKES,MONETARY,LA HONDA,CA,94020,,,NOT EMPLOYED,25.0,2021-07-01,2022-05-20,2625145 - IDT41425
...,...,...,...,...,...,...,...,...,...,...,...,...
23381,"1-800 CONTACTS, INC.",MONETARY,DRAPER,UT,84020,,,,10000.0,2021-12-22,2022-06-06,2644331 - INC3631
23382,"T-MOBILE USA, INC.",MONETARY,BELLEVUE,WA,98006,,,,5000.0,2021-12-22,2022-06-06,2644331 - INC3632
23383,"GOOGLE, LLC AND AFFILIATED ENTITIES",MONETARY,MOUNTAIN VIEW,CA,94043,,,,32400.0,2021-12-27,2022-06-06,2644331 - INC3634
23384,AIRLINES FOR AMERICA,MONETARY,WASHINGTON,DC,20004,,,,10000.0,2021-12-28,2022-06-06,2644331 - INC3642


In [12]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 23386 entries, 0 to 23385
Data columns (total 12 columns):
 #   Column               Non-Null Count  Dtype         
---  ------               --------------  -----         
 0   NAME OF CONTRIBUTOR  23386 non-null  object        
 1   PAYMENT TYPE         23386 non-null  object        
 2   CITY                 23383 non-null  object        
 3   STATE                23368 non-null  object        
 4   ZIP                  23376 non-null  object        
 5   ID NUMBER            149 non-null    float64       
 6   EMPLOYER             10234 non-null  object        
 7   OCCUPATION           23017 non-null  object        
 8   AMOUNT               23386 non-null  float64       
 9   TRANSACTION DATE     23386 non-null  datetime64[ns]
 10  FILED DATE           23386 non-null  datetime64[ns]
 11  TRANSACTION NUMBER   23386 non-null  object        
dtypes: datetime64[ns](2), float64(2), object(8)
memory usage: 2.1+ MB


POLL: Anything you're noticing about the data?

In [14]:
data = pd.read_excel(
    'export.xlsx',
    engine='openpyxl',
    dtype={
        'ID NUMBER' : object
    }
)

In [15]:
data

Unnamed: 0,NAME OF CONTRIBUTOR,PAYMENT TYPE,CITY,STATE,ZIP,ID NUMBER,EMPLOYER,OCCUPATION,AMOUNT,TRANSACTION DATE,FILED DATE,TRANSACTION NUMBER
0,CALIFORNIA CHIROPRACTIC ASSOCIATION PAC,MONETARY,SACRAMENTO,CA,95814,742986,,,-15000.0,2021-09-07,2022-05-20,2625145 - EXP2027
1,MARINA BILAVER,MONETARY,HOLLYWOOD,CA,90028,,,NOT EMPLOYED,10.0,2021-07-01,2022-05-20,2625145 - IDT41416
2,CAMERON BLOOMER,MONETARY,NEW YORK,NY,10010,,BLOOMER BIOTECH,INVESTMENT ADVISOR,100.0,2021-07-01,2022-05-20,2625145 - IDT41417
3,ALISON FLEMMING,MONETARY,LARKSPUR,CA,94904,,COOPER & MCCLOSKEY,INSURANCE BROKER,150.0,2021-07-01,2022-05-20,2625145 - IDT41424
4,WILLIAM FOWKES,MONETARY,LA HONDA,CA,94020,,,NOT EMPLOYED,25.0,2021-07-01,2022-05-20,2625145 - IDT41425
...,...,...,...,...,...,...,...,...,...,...,...,...
23381,"1-800 CONTACTS, INC.",MONETARY,DRAPER,UT,84020,,,,10000.0,2021-12-22,2022-06-06,2644331 - INC3631
23382,"T-MOBILE USA, INC.",MONETARY,BELLEVUE,WA,98006,,,,5000.0,2021-12-22,2022-06-06,2644331 - INC3632
23383,"GOOGLE, LLC AND AFFILIATED ENTITIES",MONETARY,MOUNTAIN VIEW,CA,94043,,,,32400.0,2021-12-27,2022-06-06,2644331 - INC3634
23384,AIRLINES FOR AMERICA,MONETARY,WASHINGTON,DC,20004,,,,10000.0,2021-12-28,2022-06-06,2644331 - INC3642


### Export

In [16]:
data.to_csv('newsom_contribs.csv', index=False)