# Let's practice reading in datasets

### What's an import?
To use any package in your code, you must first make it accessible. You have to import it. You can't use anything in Python before it is defined. Some things are built in, for example the basic types (like int, float, etc) can be used whenever you want. But most things you will want to do will need a little more than that. Importing a package (like pandas) makes it accessible in the current scope. Since we used Anaconda to obtain python, we have already installed many of the most popular packages (like `numpy` and `pandas`) and we just have to use `import` to activate them for this session.

In [1]:
# Let's import the pandas package.
import pandas as pd

## Practice reading data files into pandas

In [2]:
# Try this out:
df=pd.read_csv('../data/ufo.csv')

In [3]:
# On a Windows machine, your file path may have to look like this:
# df=pd.read_csv(r'..\..\datasets\ufo.csv')

In [4]:
# Look at the first few lines of data:
df.head()

Unnamed: 0,City,Colors Reported,Shape Reported,State,Time
0,Ithaca,,TRIANGLE,NY,6/1/1930 22:00
1,Willingboro,,OTHER,NJ,6/30/1930 20:00
2,Holyoke,,OVAL,CO,2/15/1931 14:00
3,Abilene,,DISK,KS,6/1/1931 13:00
4,New York Worlds Fair,,LIGHT,NY,4/18/1933 19:00


In [5]:
# And the last few:
df.tail()

Unnamed: 0,City,Colors Reported,Shape Reported,State,Time
80538,Neligh,,CIRCLE,NE,9/4/2014 23:20
80539,Uhrichsville,,LIGHT,OH,9/5/2014 1:14
80540,Tucson,RED BLUE,,AZ,9/5/2014 2:40
80541,Orland park,RED,LIGHT,IL,9/5/2014 3:43
80542,Loughman,,LIGHT,FL,9/5/2014 5:30


In [None]:
# Show a random selection of rows:
df.sample()

In [6]:
# How big is my dataset?
df.shape

(80543, 5)

In [7]:
# What are the columns?
df.columns

Index(['City', 'Colors Reported', 'Shape Reported', 'State', 'Time'], dtype='object')

In [8]:
# What type of object is this?
type(df)

pandas.core.frame.DataFrame

In [9]:
# Show me some high-level statistics about my dataframe.
results=df.describe()
results

Unnamed: 0,City,Colors Reported,Shape Reported,State,Time
count,80496,17034,72141,80543,80543
unique,13504,31,27,52,68901
top,Seattle,ORANGE,LIGHT,CA,7/4/2014 22:00
freq,646,5216,16332,10743,45


### Save your results as a new csv file

In [None]:
# This will save to your current working directory.
results.to_csv('my_results.csv')

In [None]:
# read it back in.
pd.read_csv('my_results.csv')

## Now try this with a few other data files from the datasets folder.

In [11]:
# try baseball players:
df1=pd.read_csv('../data/hitters.csv')
df1.head()

Unnamed: 0,AtBat,Hits,HmRun,Runs,RBI,Walks,Years,CAtBat,CHits,CHmRun,CRuns,CRBI,CWalks,League,Division,PutOuts,Assists,Errors,Salary,NewLeague
0,293,66,1,30,29,14,1,293,66,1,30,29,14,A,E,446,33,20,,A
1,315,81,7,24,38,39,14,3449,835,69,321,414,375,N,W,632,43,10,475.0,N
2,479,130,18,66,72,76,3,1624,457,63,224,266,263,A,W,880,82,14,480.0,A
3,496,141,20,65,78,37,11,5628,1575,225,828,838,354,N,E,200,11,3,500.0,N
4,321,87,10,39,42,30,2,396,101,12,48,46,33,N,E,805,40,4,91.5,N


In [12]:
# try movie ratings:
df2=pd.read_csv('../data/movie_ratings.tsv', sep='\t')
df2.head()

Unnamed: 0,196,242,3,881250949
0,186,302,3,891717742
1,22,377,1,878887116
2,244,51,2,880606923
3,166,346,1,886397596
4,298,474,4,884182806


In [13]:
# try a json file:
df2=pd.read_json('../data/yelp.json', lines=True)
df2.head()

Unnamed: 0,business_id,date,review_id,stars,text,type,user_id,votes
0,9yKzy9PApeiPPOUJEtnvkg,2011-01-26,fWKvX83p0-ka4JS3dc6E5A,5,My wife took me here on my birthday for breakf...,review,rLtl8ZkDX5vH5nAx9C3q5Q,"{'funny': 0, 'useful': 5, 'cool': 2}"
1,ZRJwVLyzEJq1VAihDhYiow,2011-07-27,IjZ33sJrzXqU-0X6U8NwyA,5,I have no idea why some people give bad review...,review,0a2KyEL0d3Yb1V6aivbIuQ,"{'funny': 0, 'useful': 0, 'cool': 0}"
2,6oRAC4uyJCsJl1X0WZpVSA,2012-06-14,IESLBzqUCLdSzSqm0eCSxQ,4,love the gyro plate. Rice is so good and I als...,review,0hT2KtfLiobPvh6cDC8JQg,"{'funny': 0, 'useful': 1, 'cool': 0}"
3,_1QQZuf4zZOyFCvXc0o6Vg,2010-05-27,G-WvGaISbqqaMHlNnByodA,5,"Rosie, Dakota, and I LOVE Chaparral Dog Park!!...",review,uZetl9T0NcROGOyFfughhg,"{'funny': 0, 'useful': 2, 'cool': 1}"
4,6ozycU1RpktNG2-1BroVtw,2012-01-05,1uJFq2r5QfJG_6ExMRCaGw,5,General Manager Scott Petello is a good egg!!!...,review,vYmM4KTsC8ZfQBg-j5MWkw,"{'funny': 0, 'useful': 0, 'cool': 0}"


In [14]:
# How about a datafile that's saved online?
url="https://raw.githubusercontent.com/austinlasseter/plotly_dash_tutorial/master/00%20resources/titanic.csv"
df = pd.read_csv(url)
df.head()

Unnamed: 0,Survived,Pclass,Sex,Age,Fare,Embarked
0,0,3,male,22.0,7.25,Southampton
1,1,1,female,38.0,71.2833,Cherbourg
2,1,3,female,26.0,7.925,Southampton
3,1,1,female,35.0,53.1,Southampton
4,0,3,male,35.0,8.05,Southampton


### Using Python to find file locations on my machine

In [15]:
# Import a couple more packages for this.
import os
from pathlib import Path

In [16]:
# What is my current working directory?
home=Path.cwd()

In [17]:
# What's in that directory?
os.listdir(home)

['01-practice-reading-data.ipynb',
 '.ipynb_checkpoints',
 '02-intro-to-pandas.ipynb',
 '03-basic-eda.ipynb']

In [18]:
# What's the name of the parent directory?
home.parent

PosixPath('/Users/austinlasseter/atelier/generalassembly/datdc35/03-pandas-dataframes')

In [19]:
# What's the parent of that parent?
home.parent.parent

PosixPath('/Users/austinlasseter/atelier/generalassembly/datdc35')

In [21]:
# And what folders & files are in there?
os.listdir(home.parent.parent)

['03-pandas-dataframes',
 '.DS_Store',
 'course-info',
 '04-pandas-eda',
 '05-data-visualization',
 '02-command-line',
 '01-data-science-intro',
 '06-plotly-dash']

In [22]:
# Use this to create a path to your data.
path_to_data=Path(home.parent, 'data').joinpath()
print(path_to_data)

/Users/austinlasseter/atelier/generalassembly/datdc35/03-pandas-dataframes/data


In [23]:
# What files are inside that path?
os.listdir(path_to_data)

['old-faithful.csv',
 'collegeadmissions.csv',
 'u.item',
 'yelp.json',
 'msleep.csv',
 'beer.txt',
 '.DS_Store',
 'Production.ProductSubcategory.csv',
 'drinks.csv',
 'apply functions in pandas.ipynb',
 'imdb_1000.csv',
 'imdb_ids.txt',
 'oracle_10k.csv',
 'airlines.csv',
 'u.data',
 'ozone.csv',
 'vti.csv',
 'user.tbl',
 'ufo.csv',
 'u.user_original',
 'rossmann-stores.csv',
 'Sales.SalesOrderHeader.csv',
 'titanic.csv',
 'wine.csv',
 'student_comments.csv',
 'haystack.csv',
 'drones.csv',
 'movie_ratings.tsv',
 'mtcars.csv',
 'u.user',
 'bikeshare.csv',
 'hitters.csv',
 'features.csv',
 'NBA_players_2015.csv',
 'Sales.SalesOrderDetail.csv',
 'Production.Product.csv',
 'chipotle.tsv',
 'bank-additional.csv',
 'vehicles_train.csv',
 'vehicles_test.csv',
 'stores.csv']

In [24]:
# Build an indestructible filepath this way.
filepath=Path(home.parent, 'data', 'bikeshare.csv').joinpath()
print(filepath)

/Users/austinlasseter/atelier/generalassembly/datdc35/03-pandas-dataframes/data/bikeshare.csv


In [25]:
# Read that filepath into pandas.
df=pd.read_csv(filepath)
df.head()

Unnamed: 0,instant,dteday,season,yr,mnth,hr,holiday,weekday,workingday,weathersit,temp,atemp,hum,windspeed,casual,registered,cnt
0,1,2011-01-01,1,0,1,0,0,6,0,1,0.24,0.2879,0.81,0.0,3,13,16
1,2,2011-01-01,1,0,1,1,0,6,0,1,0.22,0.2727,0.8,0.0,8,32,40
2,3,2011-01-01,1,0,1,2,0,6,0,1,0.22,0.2727,0.8,0.0,5,27,32
3,4,2011-01-01,1,0,1,3,0,6,0,1,0.24,0.2879,0.75,0.0,3,10,13
4,5,2011-01-01,1,0,1,4,0,6,0,1,0.24,0.2879,0.75,0.0,0,1,1


## Practice

In [None]:
## Open a dataset and produce the following statistics:

# Use pathlib and os to generate the filepath.
# How many rows and columns?
# Show the first 5 rows of data
# List all the columns
# Describe the numeric columns
# Save your results as an Excel file.