# Pandas Data Functions

One of the most appealing aspects of Pandas is the simplicity that it offers for data imports/exports and exploration.

#### Import Dependencies

In [1]:
import pandas as pd
import os

#### Save path to data set in a variable

In [2]:
data_file = os.path.join('..', 'Resources', 'dataSet.csv')

#### Read csv file 

Pandas makes it embarrassingly easy to read csv files with its `read_csv()` function

`head()` is a function to be executed against a DataFrame that returns the top 5 rows of data. More (or less) rows of data can be returned by providing an argument for how many rows you'd like to receive.

In [3]:
data_file_pd = pd.read_csv(data_file)
data_file_pd.head()

Unnamed: 0,id,First Name,Last Name,Gender,Amount
0,1,Todd,Lopez,M,8067.7
1,2,Joshua,White,M,7330.1
2,3,Mary,Lewis,F,16335.0
3,4,Emily,Burns,F,12460.8
4,5,Christina,Romero,F,15271.9


#### Display the datatype of each column

Notice that this line of code doesn't end with `()`. This is because `dtypes` is a property/attribute of a DataFrame, as opposed to a function that is being executed against the DataFrame.

In [4]:
data_file_pd.dtypes

id              int64
First Name     object
Last Name      object
Gender         object
Amount        float64
dtype: object

#### Display a statistical overview of the DataFrame

[`describe()`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.describe.html) is a function that returns a summarized analysis of your quantitative data.

In [5]:
data_file_pd.describe()

Unnamed: 0,id,Amount
count,1000.0,1000.0
mean,500.5,10051.3236
std,288.819436,5831.230806
min,1.0,3.4
25%,250.75,4854.875
50%,500.5,10318.05
75%,750.25,15117.425
max,1000.0,19987.4


#### Reference a single column within a DataFrame

In [6]:
data_file_pd["Amount"].head()

0     8067.7
1     7330.1
2    16335.0
3    12460.8
4    15271.9
Name: Amount, dtype: float64

#### Reference multiple columns within a DataFrame

Notice that the columns are provided as a list.

In [7]:
data_file_pd[["Amount", "Gender"]].head()

Unnamed: 0,Amount,Gender
0,8067.7,M
1,7330.1,M
2,16335.0,F
3,12460.8,F
4,15271.9,F


#### The `mean()` method averages the series

In [8]:
average = data_file_pd["Amount"].mean()
average

10051.323600000002

#### The `sum()` method adds every entry in the series

In [9]:
total = data_file_pd["Amount"].sum()
total

10051323.600000001

#### The `unique()` method shows all distinct values for a given Series.

In [10]:
unique = data_file_pd["Last Name"].unique()
unique

array(['Lopez', 'White', 'Lewis', 'Burns', 'Romero', 'Andrews', 'Baker',
       'Diaz', 'Burke', 'Richards', 'Hansen', 'Tucker', 'Wheeler',
       'Turner', 'Reynolds', 'Carpenter', 'Scott', 'Ryan', 'Marshall',
       'Fernandez', 'Olson', 'Riley', 'Woods', 'Wells', 'Gutierrez',
       'Harvey', 'Ruiz', 'Lee', 'Welch', 'Cooper', 'Nichols', 'Murray',
       'Gomez', 'Green', 'Jacobs', 'Griffin', 'Perry', 'Dunn', 'Gardner',
       'Gray', 'Walker', 'Harris', 'Lawrence', 'Black', 'Simpson', 'Sims',
       'Weaver', 'Carr', 'Owens', 'Stephens', 'Butler', 'Matthews', 'Cox',
       'Brooks', 'Austin', 'Moore', 'Hunter', 'Cunningham', 'Lane',
       'Montgomery', 'Vasquez', 'Freeman', 'Hernandez', 'Alexander',
       'Pierce', 'Mcdonald', 'Kelly', 'Foster', 'Bell', 'Johnson',
       'Bowman', 'Porter', 'Wood', 'Reid', 'Willis', 'Bishop',
       'Washington', 'Gonzales', 'Davis', 'Martinez', 'Martin', 'Long',
       'Howell', 'Hawkins', 'Knight', 'Price', 'Day', 'Bailey', 'Flores',
       'You

#### The [`value_counts()`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.value_counts.html) method counts the number of occurrences of each value in a column

In [11]:
count = data_file_pd["Gender"].value_counts()
count

M    515
F    485
Name: Gender, dtype: int64

#### Calculations can also be performed on Series and added into DataFrames as new columns

In [12]:
thousands_of_dollars = data_file_pd["Amount"]/1000
data_file_pd["Thousands of Dollars"] = thousands_of_dollars

data_file_pd.head()

Unnamed: 0,id,First Name,Last Name,Gender,Amount,Thousands of Dollars
0,1,Todd,Lopez,M,8067.7,8.0677
1,2,Joshua,White,M,7330.1,7.3301
2,3,Mary,Lewis,F,16335.0,16.335
3,4,Emily,Burns,F,12460.8,12.4608
4,5,Christina,Romero,F,15271.9,15.2719


In [14]:
data_file_pd.describe()

Unnamed: 0,id,Amount,Thousands of Dollars
count,1000.0,1000.0,1000.0
mean,500.5,10051.3236,10.051324
std,288.819436,5831.230806,5.831231
min,1.0,3.4,0.0034
25%,250.75,4854.875,4.854875
50%,500.5,10318.05,10.31805
75%,750.25,15117.425,15.117425
max,1000.0,19987.4,19.9874


In [15]:
#Reference a single Column
data_file_pd["Amount"]


0       8067.7
1       7330.1
2      16335.0
3      12460.8
4      15271.9
        ...   
995    17868.5
996    15182.9
997     3720.7
998    10824.6
999     6090.7
Name: Amount, Length: 1000, dtype: float64

In [16]:
#Reference a Multiple Columns
data_file_pd[["First Name","Amount"]]

Unnamed: 0,First Name,Amount
0,Todd,8067.7
1,Joshua,7330.1
2,Mary,16335.0
3,Emily,12460.8
4,Christina,15271.9
...,...,...
995,Paula,17868.5
996,Paula,15182.9
997,Thomas,3720.7
998,Jacqueline,10824.6


In [17]:
# Average
data_file_pd["Amount"].mean()

10051.323600000002

In [18]:
# Sum
data_file_pd["Amount"].sum()

10051323.600000001

In [19]:
# All distinct values
data_file_pd["Amount"].unique


<bound method Series.unique of 0       8067.7
1       7330.1
2      16335.0
3      12460.8
4      15271.9
        ...   
995    17868.5
996    15182.9
997     3720.7
998    10824.6
999     6090.7
Name: Amount, Length: 1000, dtype: float64>

In [20]:
#Count number ov occurences
data_file_pd["Gender"].value_counts()

M    515
F    485
Name: Gender, dtype: int64