**Note:**

This set of exercises is taken from the **[Learn Pandas](https://www.kaggle.com/learn/pandas)** series on Kaggle. It is distributed under the [Apache 2.0 license](http://www.apache.org/licenses/LICENSE-2.0). It is only lightly modified for use in this class.


# Introduction

The first step in most data analytics projects is reading the data file. In this section, you'll create `Series` and `DataFrame` objects, both by hand and by reading data files.

**(Grading: Two points each.)**

In [1]:
# import needed libraries
import pandas as pd
pd.set_option('max_rows', 5)

# Exercises

**Exercise 1**: Create a `DataFrame` that looks like this:

![](https://i.imgur.com/Ax3pp2A.png)

In [2]:
expected = pd.DataFrame({'Apples': [30], 'Bananas': [21]})
expected

Unnamed: 0,Apples,Bananas
0,30,21


**Exercise 2**: Create the following `DataFrame`:

![](https://i.imgur.com/CHPn7ZF.png)

In [3]:
expected = pd.DataFrame(
        {'Apples': [35, 41], 'Bananas': [21, 34]},
        index=['2017 Sales', '2018 Sales']
    )
expected

Unnamed: 0,Apples,Bananas
2017 Sales,35,21
2018 Sales,41,34


**Exercise 3**: Create a `Series` that looks like this:

```
Flour     4 cups
Milk       1 cup
Eggs     2 large
Spam       1 can
Name: Dinner, dtype: object
```

In [4]:
expected = pd.Series(['4 cups', '1 cup', '2 large', '1 can'],
                         index=['Flour', 'Milk', 'Eggs', 'Spam'],
                         name='Dinner')
expected

Flour     4 cups
Milk       1 cup
Eggs     2 large
Spam       1 can
Name: Dinner, dtype: object

**Exercise 4**: Read the following `csv` dataset on wine reviews into the a `DataFrame`:

![](https://i.imgur.com/74RCZtU.png)

The filename for the CSV file `winemag-data_first150k.csv`. You can download it here: https://www.kaggle.com/zynicide/wine-reviews/downloads/winemag-data_first150k.csv/4

NOTE: When you commit your code to Git, do not commit the data files. They will be too big. Use ".gitignore" to ignore them.

In [5]:
expected = pd.read_csv("winemag-data_first150k.csv", index_col=0)
expected.head()

Unnamed: 0,country,description,designation,points,price,province,region_1,region_2,variety,winery
0,US,This tremendous 100% varietal wine hails from ...,Martha's Vineyard,96,235.0,California,Napa Valley,Napa,Cabernet Sauvignon,Heitz
1,Spain,"Ripe aromas of fig, blackberry and cassis are ...",Carodorum Selección Especial Reserva,96,110.0,Northern Spain,Toro,,Tinta de Toro,Bodega Carmen Rodríguez
2,US,Mac Watson honors the memory of a wine once ma...,Special Selected Late Harvest,96,90.0,California,Knights Valley,Sonoma,Sauvignon Blanc,Macauley
3,US,"This spent 20 months in 30% new French oak, an...",Reserve,96,65.0,Oregon,Willamette Valley,Willamette Valley,Pinot Noir,Ponzi
4,France,"This is the top wine from La Bégude, named aft...",La Brûlade,95,66.0,Provence,Bandol,,Provence red blend,Domaine de la Bégude


**Exercise 5**: Read the following `xls` sheet into a `DataFrame`: 

![](https://i.imgur.com/QZJBIBF.png)

The filename for the XLS file is `WICAgencies2014ytd.xls`. You can download it here:  https://www.kaggle.com/jpmiller/publicassistance/downloads/xls_files_all.zip/2

Hint: the name of the method you need inclues the word `excel`. The name of the sheet is `Pregnant Women Participating`. Don't do any cleanup before or after the import. Just import the data to replicate the image above.

In [6]:
expected = pd.read_excel("./xls_files_all/WICAgencies2014ytd.xls",
                             sheet_name='Pregnant Women Participating')
expected.head()

Unnamed: 0,WIC PROGRAM -- NUMBER OF PREGNANT WOMEN PARTICIPATING,Unnamed: 1,Unnamed: 2,Unnamed: 3,Unnamed: 4,Unnamed: 5,Unnamed: 6,Unnamed: 7,Unnamed: 8,Unnamed: 9,Unnamed: 10,Unnamed: 11,Unnamed: 12,Unnamed: 13
0,FISCAL YEAR 2014,,,,,,,,,,,,,
1,"Data as of January 05, 2018",,,,,,,,,,,,,
2,,,,,,,,,,,,,,
3,State Agency or Indian Tribal Organization,2013-10-01 00:00:00,2013-11-01 00:00:00,2013-12-01 00:00:00,2014-01-01 00:00:00,2014-02-01 00:00:00,2014-03-01 00:00:00,2014-04-01 00:00:00,2014-05-01 00:00:00,2014-06-01 00:00:00,2014-07-01 00:00:00,2014-08-01 00:00:00,2014-09-01 00:00:00,Average Participation
4,Connecticut,5847,5476,5274,5360,5056,5319,5500,5717,5703,5905,5754,5624,5544.58


**Exercise 6**: Suppose we have the following `DataFrame`:

In [7]:
q6_df = pd.DataFrame({'Cows': [12, 20], 'Goats': [22, 19]}, index=['Year 1', 'Year 2'])

Save this `DataFrame` to disc as a `csv` file with the name `cows_and_goats.csv`.

In [8]:
expected = pd.DataFrame({'Cows': [12, 20], 'Goats': [22, 19]}, index=['Year 1', 'Year 2'])

In [9]:
expected.to_csv('cows_and_goats.csv')

In [10]:
import os
expected = pd.DataFrame({'Cows': [12, 20], 'Goats': [22, 19]}, index=['Year 1', 'Year 2'])
os.path.exists("cows_and_goats.csv") and pd.read_csv("cows_and_goats.csv", index_col=0).equals(expected)

True

**Exercise 7**: Read the following `SQL` data into a `DataFrame`:

![](https://i.imgur.com/mmvbOT3.png)

The filename is `database.sqlite`. You can download the data here: https://www.kaggle.com/nolanbconaway/pitchfork-data/downloads/database.sqlite/1

Hint: use the `sqlite3` library. The name of the table is `artists`.

In [11]:
import sqlite3
conn = sqlite3.connect("database.sqlite")
expected = pd.read_sql_query("SELECT * FROM artists", conn)
expected.head()

Unnamed: 0,reviewid,artist
0,22703,massive attack
1,22721,krallice
2,22659,uranium club
3,22661,kleenex
4,22661,liliput
