### Pandas - Creating, Loading, and Selecting a DataFrame.
Pandas is an open source, easy-to-use data structures and data analysis tools for the Python programming language.
- Creating a table from scratch.
- Loading data from another file.
- Selecting certain rows or columns of a table.

In [1]:
# Import Pandas Module:
# Pandas Module is usually imported at the top of a Python file under the alias pd.
import pandas as pd

In [2]:
# Create a DataFrame: a dictionary to pd.DataFrame(). 
# Each key is a column name and each value is a list of column values.
# There's many ways to create a DF.

raw_data = {
    'first_name': ['Chris', 'Jason', 'Ethane', 'Julia'],
    'last_name': ['Ally', 'Sobsun', 'Pasero', 'Ducruiz'],
    'age': [22, 45, 18, 37],
    'scoring': [3, 4, 5, 1]
}

pd.DataFrame(raw_data)

Unnamed: 0,age,first_name,last_name,scoring
0,22,Chris,Ally,3
1,45,Jason,Sobsun,4
2,18,Ethane,Pasero,5
3,37,Julia,Ducruiz,1


In [3]:
# Create a DataFrame: a list of lists to pd.DataFrame(). 
# Each one represents a row of data. Use the keyword argument columns to pass a list of column names.
# !!! In this example, we can control the ordering of the columns because we use lists.

df = pd.DataFrame([
        ['Chris', 'Ally', 22, 3],
        ['Jason', 'Sobsun', 45, 4],
        ['Ethane', 'Pasero', 18, 5],
        ['Julia', 'Ducruiz', 37, 1]],
        
        columns = ['first_name', 'last_name', 'age', 'scoring']
)
df

Unnamed: 0,first_name,last_name,age,scoring
0,Chris,Ally,22,3
1,Jason,Sobsun,45,4
2,Ethane,Pasero,18,5
3,Julia,Ducruiz,37,1


In [4]:
# Load CSV file:
# Load it with .read_csv()
df_checklist = pd.read_csv('files/users.csv')
df_checklist = pd.read_csv('files/users.csv', sep=';') # in this example, we use sep='' to get a better form.
df_checklist

Unnamed: 0,Name,Username,Email
0,Roger Smith,rsmith,wigginsryan@yahoo.com
1,Michelle Beck,mlbeck,hcosta@hotmail.com
2,Ashley Barker,a_bark_x,a_bark_x@turner.com
3,Lynn Gonzales,goodmanjames,lynniegonz@hotmail.com
4,Jennifer Chase,chasej,jchase@ramirez.com
5,Charles Hoover,choover,choover89@yahoo.com
6,Adrian Evans,adevans,adevans98@yahoo.com
7,Susan Walter,susan82,swilliams@yahoo.com
8,Stephanie King,stephanieking,sking@morris-tyler.com
9,Erika Miller,jessica32,ejmiller79@yahoo.com


In [5]:
# Load txt file with same way:
df_users = pd.read_csv('files/welcome_user.txt')
df_users

Unnamed: 0,#########################
0,Python is cool!!!
1,#########################


In [6]:
# Load xls file with same way:
# Dataset on https://catalog.data.gov/dataset/time-to-hire
# This dataset represents time taken to hire a GSA employee from the internal request to hire 
# through the entry on duty of the of the selected individual.
gsa_time_hire = pd.read_excel('files/time-to-hire-data-file.xlsx')
gsa_time_hire.head(3)

Unnamed: 0,DEPT_DESC,DEPT_SHORT_DESC,VACANCY,APPLICATION_COUNT,HIRE_COUNT,RECEIVED_DATE,APPROVED_DATE,OPEN_DATE,CLOSE_DATE,ISSUE_DATE,REFERRAL_RETURNED,POS_OFFERED,VAC_LOCATION,HIRED_DATE,SERIES
0,(7P) Office of the Assist. Regional Administra...,7,471551,38,1,2004-08-18,2004-08-19,2004-08-19,2004-09-17,2004-10-01,2004-11-05,2004-11-05,"Fort Worth, TX",2004-11-14,0301AU
1,(7P) Office of the Assist. Regional Administra...,7,471811,9,1,2004-09-17,2004-09-20,2004-09-20,2004-10-04,2004-10-13,2004-11-05,2004-11-05,"Fort Worth, TX",2004-11-14,0343B
2,(7P) Office of the Assist. Regional Administra...,7,471811,9,1,2004-09-17,2004-09-20,2004-09-20,2004-10-04,2004-10-13,2004-11-05,2004-11-05,"Fort Worth, TX",2004-11-14,0343B


In [7]:
# Save CSV file:
# Save it with .to_csv
# We take the Dataframe df (seen above)
df.to_csv('files/csv_file.csv')

In [8]:
# Inspect a DataFrame:
# .head() method - get the first 5 rows of a DataFrame.
# .tail() method - get the last 5 rows of a DataFrame.
# .info() method - get some statistics for each column.
df_checklist.head()

Unnamed: 0,Name,Username,Email
0,Roger Smith,rsmith,wigginsryan@yahoo.com
1,Michelle Beck,mlbeck,hcosta@hotmail.com
2,Ashley Barker,a_bark_x,a_bark_x@turner.com
3,Lynn Gonzales,goodmanjames,lynniegonz@hotmail.com
4,Jennifer Chase,chasej,jchase@ramirez.com


In [9]:
df_checklist.tail()

Unnamed: 0,Name,Username,Email
5,Charles Hoover,choover,choover89@yahoo.com
6,Adrian Evans,adevans,adevans98@yahoo.com
7,Susan Walter,susan82,swilliams@yahoo.com
8,Stephanie King,stephanieking,sking@morris-tyler.com
9,Erika Miller,jessica32,ejmiller79@yahoo.com


In [10]:
# … Example .info() method
df_checklist.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10 entries, 0 to 9
Data columns (total 3 columns):
Name        10 non-null object
Username    10 non-null object
Email       10 non-null object
dtypes: object(3)
memory usage: 320.0+ bytes


In [11]:
# head(n) and tail(n) methods get n = 5 by default
# But n can be any number
df_checklist.head(8)

Unnamed: 0,Name,Username,Email
0,Roger Smith,rsmith,wigginsryan@yahoo.com
1,Michelle Beck,mlbeck,hcosta@hotmail.com
2,Ashley Barker,a_bark_x,a_bark_x@turner.com
3,Lynn Gonzales,goodmanjames,lynniegonz@hotmail.com
4,Jennifer Chase,chasej,jchase@ramirez.com
5,Charles Hoover,choover,choover89@yahoo.com
6,Adrian Evans,adevans,adevans98@yahoo.com
7,Susan Walter,susan82,swilliams@yahoo.com


In [12]:
# Select Columns (Series Object) this is a vector:
# After that(create + load) a dataframe, we can select a part of data.
# It's depends on notation!
df.first_name
# or df['first_name']

0     Chris
1     Jason
2    Ethane
3     Julia
Name: first_name, dtype: object

In [13]:
# Select Multiple Columns:
# list of lists notation.
df[['first_name', 'last_name']]

Unnamed: 0,first_name,last_name
0,Chris,Ally
1,Jason,Sobsun
2,Ethane,Pasero
3,Julia,Ducruiz


In [14]:
# Check it out!
# With type() we can see the difference between only one column and many columns.
print(type(df[['first_name', 'last_name']])) #DataFrame
print(type(df.first_name)) #Series

<class 'pandas.core.frame.DataFrame'>
<class 'pandas.core.series.Series'>


In [15]:
# Select Rows: df.iloc[] (Pandas function)
# Remember that DataFrames are zero-indexed. The first row is accessed using df.iloc[0].
print(df.iloc[2])

# Observation! We have an Pandas Series Object…
print(type(df.iloc[2]))

first_name    Ethane
last_name     Pasero
age               18
scoring            5
Name: 2, dtype: object
<class 'pandas.core.series.Series'>


In [16]:
# Select Multiple Rows: Many way! Indexing just the rows.
# With a list of integers:
print(df.iloc[[2,3]])

# With a slice object:
print(df.iloc[2:])

# With a boolean mask the same length as the index:
print(df.iloc[[False, False, True, True]])

  first_name last_name  age  scoring
2     Ethane    Pasero   18        5
3      Julia   Ducruiz   37        1
  first_name last_name  age  scoring
2     Ethane    Pasero   18        5
3      Julia   Ducruiz   37        1
  first_name last_name  age  scoring
2     Ethane    Pasero   18        5
3      Julia   Ducruiz   37        1


In [17]:
# Select Multiple Rows: Many way! Indexing both axes.
# You can mix the indexer types for the index and columns. Use : to select the entire axis.
print(df)
print("####################################")
# With scalar integers:
print(df.iloc[0, 1])
print("----------------")
# With lists of integers:
print(df.iloc[[0, 1], [1, 2]])
print("----------------")
# With slice objects:
print(df.iloc[0, :])
print("----------------")
# With a boolean array whose length matches the columns.
print(df.iloc[:, [True, False, True]])

  first_name last_name  age  scoring
0      Chris      Ally   22        3
1      Jason    Sobsun   45        4
2     Ethane    Pasero   18        5
3      Julia   Ducruiz   37        1
####################################
Ally
----------------
  last_name  age
0      Ally   22
1    Sobsun   45
----------------
first_name    Chris
last_name      Ally
age              22
scoring           3
Name: 0, dtype: object
----------------
  first_name  age
0      Chris   22
1      Jason   45
2     Ethane   18
3      Julia   37


In [18]:
# Select Rows with condition: df[df.MyColumnName == column_value]
print(df[df.age >40])
print("----------------")
print(df[df.last_name != "Sobsun"])
print("----------------")
print(df[df.first_name == "Ethane"])

  first_name last_name  age  scoring
1      Jason    Sobsun   45        4
----------------
  first_name last_name  age  scoring
0      Chris      Ally   22        3
2     Ethane    Pasero   18        5
3      Julia   Ducruiz   37        1
----------------
  first_name last_name  age  scoring
2     Ethane    Pasero   18        5


In [19]:
# Select Rows with manys conditions: 
print(df[(df.age < 40) | (df.last_name != "Sobsun")])
print("----------------")
print(df[(df.age > 20) & (df.last_name != "Sobsun")])

  first_name last_name  age  scoring
0      Chris      Ally   22        3
2     Ethane    Pasero   18        5
3      Julia   Ducruiz   37        1
----------------
  first_name last_name  age  scoring
0      Chris      Ally   22        3
3      Julia   Ducruiz   37        1


In [20]:
# Select Rows with logic function Series.isin(values):
# Return a boolean Series showing whether each element in the Series matches 
# an element in the passed sequence of values exactly.
print(df.first_name.isin(['Chirs', 'Julia']))
print("----------------")

# Of course, we can print values…
print(df[df.first_name.isin(['Chris', 'Julia'])])

0    False
1    False
2    False
3     True
Name: first_name, dtype: bool
----------------
  first_name last_name  age  scoring
0      Chris      Ally   22        3
3      Julia   Ducruiz   37        1


In [21]:
# Set indices by .reset_index() method:
# We get a new DataFrame with a new set of indices.
df_new = df[(df.age < 40) | (df.last_name != "Sobsun")].reset_index()
df_new

Unnamed: 0,index,first_name,last_name,age,scoring
0,0,Chris,Ally,22,3
1,2,Ethane,Pasero,18,5
2,3,Julia,Ducruiz,37,1


In [22]:
# .reset_index() with keyword drop=True
# Check out the index difference!
df_new = df[(df.age < 40) | (df.last_name != "Sobsun")].reset_index(drop=True)
df_new

Unnamed: 0,first_name,last_name,age,scoring
0,Chris,Ally,22,3
1,Ethane,Pasero,18,5
2,Julia,Ducruiz,37,1
