### Working with Pandas Dataframes

The main data structure that people use, and want to use, is the pandas dataframe. Dataframes are two-dimensional tables that look and work similar to an Excel spreadsheet. So long as you use .loc and .iloc to retrieve elements via the index, you’ll be fine.

But of course, data frames also have columns, each of which has a name. Each column is
effectively its own series, which means that it has an independent dtype from other columns.

In a typical data frame, each column represents **a feature, or attribute**, of our data, while each row represents one sample. So in a data frame describing company employees, there would be one row per employee, and there would be columns for first name, last name, ID number, e-mail address, and salary.

In [18]:
from pandas import DataFrame, concat

product_df = DataFrame([
               {'product_id':23, 'name':'computer', 'wholesale_price': 500,
                 'retail_price':1000, 'sales':100},
               {'product_id':96, 'name':'Python Workout', 'wholesale_price': 35,
                'retail_price':75, 'sales':1000},
               {'product_id':97, 'name':'Pandas Workout', 'wholesale_price': 35,
                'retail_price':75, 'sales':500},
               {'product_id':15, 'name':'banana', 'wholesale_price': 0.5,
                'retail_price':1, 'sales':200},
               {'product_id':87, 'name':'sandwich', 'wholesale_price': 3,
                'retail_price':5, 'sales':300},
               ])

product_df

Unnamed: 0,product_id,name,wholesale_price,retail_price,sales
0,23,computer,500.0,1000,100
1,96,Python Workout,35.0,75,1000
2,97,Pandas Workout,35.0,75,500
3,15,banana,0.5,1,200
4,87,sandwich,3.0,5,300


In [19]:
# Net Revenue

(product_df['sales'] * (product_df['retail_price'] - product_df['wholesale_price'])).sum()

110700.0

Creating a new dataframe:

- list of lists/series, in which each inner list represents one row, and the column names are taken positionally

- list of dicts, in which the dict keys indicate which columns are set to each row

- dict of lists/series, in which the dict keys determine the column names, and the values are then assigned vertically

- 2-dimensional NumPy array

df['retail_price'] - df['wholesale_price']

Here, we are retrieving the series df['retail_price'] and subtracting from it the series df['wholesale_price']. Because these two series are parallel to one another, with identical indexes, the subtraction will take place for each row, and will return a new series with the same index, but with the difference between them.

Once we have that series, we’ll multiply it by the number of sales we had for each product:

(df['retail_price'] - df['wholesale_price']) * df['sales']


This then result in a new series, one which shares an index with df, but whose values represent the total sales for each product. We can sum this together with the sum method:

((df['retail_price'] - df['wholesale_price']) * df['sales']).sum()

- On what products is our retail price more than twice the wholesale price?

- How much did the store make from food vs. computers vs. books? (You can just retrieve based on the index values, not anything more sophisticated.)

- Because your store is doing so well, you’re able to negotiate a 30% discount on the wholesale price of goods. Calculate the new net income.

In [20]:
# What product is product retail price retail price more than twice the wholesale price

product_df[product_df['retail_price'] > 2 * product_df['wholesale_price']]

Unnamed: 0,product_id,name,wholesale_price,retail_price,sales
1,96,Python Workout,35.0,75,1000
2,97,Pandas Workout,35.0,75,500


In [21]:
# How much did the store make from food vs books
books_revenue = (product_df['retail_price'] * product_df['sales']).loc[[1,2]].sum()


food_rev = (product_df['retail_price'] * product_df['sales']).loc[[3,4]].sum()

print(f'Food Revenue --> {food_rev}\nBook Revenue --> {books_revenue}')

Food Revenue --> 1700
Book Revenue --> 112500


In [22]:
# Net Revenue after Discount

(product_df['sales'] * (product_df['retail_price'] - (0.7 * product_df['wholesale_price']))).sum()

141750.0

### Tax Planning
The backstory for this exercise is as follows: Our local government is thinking about imposing a sales tax, and is thinking about 15, 20, and 25 percent rates. Show how much less you would net with each of these tax amounts by adding columns to the data frame for current income, as well


In [23]:

product_df['after_15'] = ( (product_df['retail_price'] - product_df['wholesale_price']) * product_df['sales'] ) * (1 - 0.15)

product_df['after_20'] = ( (product_df['retail_price'] - product_df['wholesale_price']) * product_df['sales'] ) * (1 - 0.20)

product_df['after_25'] = ( (product_df['retail_price'] - product_df['wholesale_price']) * product_df['sales'] ) * (1 - 0.25)


In [24]:
# Selecting rowsvwith loc indexing
product_df.loc[[0,3] , ['sales', 'after_15', 'after_20', 'after_15']]

Unnamed: 0,sales,after_15,after_20,after_15.1
0,100,42500.0,40000.0,42500.0
3,200,85.0,80.0,85.0


In [25]:
# Multiple Column Selection
product_df[['sales', 'after_15', 'after_20', 'after_15']].sum()

sales        2100.0
after_15    94095.0
after_20    88560.0
after_15    94095.0
dtype: float64

In [26]:
product_df[product_df['after_15'] <= 40000]

Unnamed: 0,product_id,name,wholesale_price,retail_price,sales,after_15,after_20,after_25
1,96,Python Workout,35.0,75,1000,34000.0,32000.0,30000.0
2,97,Pandas Workout,35.0,75,500,17000.0,16000.0,15000.0
3,15,banana,0.5,1,200,85.0,80.0,75.0
4,87,sandwich,3.0,5,300,510.0,480.0,450.0


- An alternative tax plan would charge 25% tax, but only on those products on which we
would net more than 20,000. In such a case, how much would we make?

- Yet another alternative tax plan would charge 25% tax on products whose retail price is greater than 80, 10% tax on products whose retail price is between 30 and 80, and no tax on others. Implement and calculate the result of such a tax scheme.
  
- These long floating-point numbers are getting a bit hard to read. Set the float_format
option in pandas such that the floating-point numbers will be displayed with commas
every three digits before the decimal point, and only two digits after the decimal point.
Note that this is a bit tricky, in that it requires understanding Python callables and the str.format method.

In [27]:
# An alternative tax plan would charge 25% tax, but only on those products on which we
# would net more than 20,000. In such a case, how much would we make?

product_df['net_profit'] = product_df['sales'] * (product_df['retail_price'] * product_df['wholesale_price'])

In [28]:
product_df['net_profit']

0    50000000.0
1     2625000.0
2     1312500.0
3         100.0
4        4500.0
Name: net_profit, dtype: float64

In [29]:
# Select the rows you want to modify

product_df.loc[product_df['net_profit'] > 20000, ['net_profit']] = 0.75 * product_df.loc[product_df['net_profit'] > 20000, ['net_profit']]

In [30]:
product_df

Unnamed: 0,product_id,name,wholesale_price,retail_price,sales,after_15,after_20,after_25,net_profit
0,23,computer,500.0,1000,100,42500.0,40000.0,37500.0,37500000.0
1,96,Python Workout,35.0,75,1000,34000.0,32000.0,30000.0,1968750.0
2,97,Pandas Workout,35.0,75,500,17000.0,16000.0,15000.0,984375.0
3,15,banana,0.5,1,200,85.0,80.0,75.0,100.0
4,87,sandwich,3.0,5,300,510.0,480.0,450.0,4500.0


To achieve the above, the following steps were taken
 - Calculate the net profit for all products
 - Next, select the appropriate rows from the data set with 'net_profit' > 20000 via the syntax --> product_df.loc[product_df['net_profit'] > 20000, ['net_profit']]
 - The above syntax produces a DataFram thant can be manipulated. Hence assign to --> 0.75 * product_df.loc[product_df['net_profit'] > 20000, ['net_profit']] 

Yet another alternative tax plan would charge 25% tax on products whose retail price is greater than 80, 10% tax on products whose retail price is between 30 and 80, and no tax on others. Implement and calculate the result of such a tax scheme.

In [31]:
product_df.loc[product_df['retail_price'] > 80, ['net_profit']] = 0.75 * product_df.loc[product_df['retail_price'] > 80, ['net_profit']]

In [32]:
product_df.loc[(product_df['retail_price'] > 30) & (product_df['retail_price'] < 80), ['net_profit'] ]  = 0.9 * product_df.loc[(product_df['retail_price'] > 30) & (product_df['retail_price'] < 80), ['net_profit'] ]  

In [33]:
import jinja2

product_df[['net_profit']].style.format('{:,.2f}')

Unnamed: 0,net_profit
0,28125000.0
1,1771875.0
2,885937.5
3,100.0
4,4500.0


## Adding new products

In [34]:
new_products = DataFrame([{'product_id':24, 'name':'phone', 'wholesale_price': 200,
                        'retail_price':500},
                        {'product_id':16, 'name':'apple', 'wholesale_price': 0.5,
                        'retail_price':1},
                        {'product_id':17, 'name':'pear', 'wholesale_price': 0.6,
                        'retail_price':1.2}], index=range(5,8))

product_df = concat([product_df, new_products])

product_df.loc[5, 'sales'] = 100
product_df.loc[6, 'sales'] = 200
product_df.loc[7, 'sales'] = 75


(product_df['sales'] * (product_df['retail_price'] * product_df['wholesale_price'])).sum()

63942254.0

Find the IDs and names of the products that have sold more than the average number of units

In [35]:
product_df.loc[product_df['sales'] > product_df['sales'].mean(), ['product_id', 'name']]

Unnamed: 0,product_id,name
1,96,Python Workout
2,97,Pandas Workout


In [36]:
product_df[product_df['sales'] > product_df['sales'].mean()][['product_id', 'name']]

Unnamed: 0,product_id,name
1,96,Python Workout
2,97,Pandas Workout


- Show the ID and name of those products whose net income is in the top 25% quantile.
- Show the ID and name of products that have lower than average sales numbers, and
whose wholesale price is greater than the average.
- Show the wholesale and retail prices of products with product IDs between 80 and 100,
and which sold fewer than 400 units.

Step 1: Find the Top 25% Quantile

Step 2: Lets get the products falling 

In [37]:
# Show the ID and name of those products whose net income are in the top 25% quantile

product_df.loc[product_df['net_profit'] > product_df['net_profit'].quantile(0.75)]

Unnamed: 0,product_id,name,wholesale_price,retail_price,sales,after_15,after_20,after_25,net_profit
0,23,computer,500.0,1000.0,100.0,42500.0,40000.0,37500.0,28125000.0


In [38]:
# Show the ID and name of products that have lower than average sales numbers, and
# whose wholesale price is greater than the average

product_df.loc[(product_df['sales'] < product_df['sales'].mean()) & (product_df['wholesale_price'] > product_df['wholesale_price'].mean()), ['name', 'product_id']]



Unnamed: 0,name,product_id
0,computer,23
5,phone,24


In [39]:
# Show the wholesale and retail prices of products with product IDs between 80 and 100,
# and which sold fewer than 400 units.

product_df.loc[(product_df['product_id'] < 100) & (product_df['product_id'] > 80) & (product_df['sales'] < 400)]

Unnamed: 0,product_id,name,wholesale_price,retail_price,sales,after_15,after_20,after_25,net_profit
4,87,sandwich,3.0,5.0,300.0,510.0,480.0,450.0,4500.0


## Finding Outliers

Data analysis is all about trying to better understand the information that we have collected, and use that understanding to improve our business. We’ve already seen how the mean, standard deviation, and median can all help us to understand our data. Another useful perspective is to look at the unusual elements of our data.

That is, don’t look at the normal values; instead look at the outliers. 
For example:

- Which of our users had an unusually high number of unsuccessful login attempts?
- Which of our products were the most popular?
- At which days and times are our sales low?

The term "outliers" doesn’t have a precise, standard definition. Many people
define it using the "inter-quartile range," or "IQR" for short, which is the value
at the 75% point (aka ) minus the value a quantile(0.75) the 25% point (aka
quantile(0.25)).


Outliers would then be values below the 25% point - 1.5 * IQR, or any values above the 75% + 1.5 * IQR. We’ll use that definition here, but you might find that a different definition—say, anything below the mean - two standard deviations, or above the mean + two standard deviations, might be a better fit for your data.

In [40]:
import pandas as pd

# Import as a series
trip_dist =  pd.read_csv('data/taxi-distance.csv', squeeze=True, header=None)

# Import as a series
taxi_count = pd.read_csv('data/taxi-passenger-count.csv', squeeze=True, header=None)

taxi_trip_df = DataFrame({
    'Trip_Distance': trip_dist,
    'Taxi_Count': taxi_count
})

taxi_trip_df.describe

<bound method NDFrame.describe of       Trip_Distance  Taxi_Count
0              1.63           1
1              0.46           1
2              0.87           1
3              2.13           1
4              1.40           1
...             ...         ...
9994           2.70           1
9995           4.50           1
9996           5.59           1
9997           1.54           6
9998           5.80           1

[9999 rows x 2 columns]>

In [41]:
# Trip Distance Outliers

iqr = taxi_trip_df['Trip_Distance'].quantile(0.25) - taxi_trip_df['Trip_Distance'].quantile(0.75)

low_qtl = taxi_trip_df['Trip_Distance'].quantile(0.25)


low_outlier = low_qtl + 1.5 * (iqr)

high_qtl = taxi_trip_df['Trip_Distance'].quantile(0.75)

high_outlier = high_qtl - 1.5 * (iqr)

print(f' {low_outlier} {iqr} {high_outlier} ')

#taxi_trip_df.loc[(taxi_trip_df['Trip_Distance'] < (low_outlier))]

taxi_trip_df.loc[(taxi_trip_df['Trip_Distance'] > (high_outlier))]

 -2.4499999999999997 -2.3 6.75 


Unnamed: 0,Trip_Distance,Taxi_Count
7,11.90,4
60,9.30,1
73,12.65,1
82,10.24,3
88,23.76,2
...,...,...
9975,7.60,1
9976,12.60,1
9979,11.30,1
9980,9.13,1


## Missing Data 

So far, we have seen that analyzing data with pandas isn’t too difficult. We
need to know what questions to ask, and we need to know which methods to
apply in a given situation—but it’s easy to imagine that a data analyst’s job
isn’t too rough.

The time has come, then, to give you some bad news: Most data is
incomplete. Perhaps the computer responsible for collecting data was down
last week. Or perhaps the sensors were off. Or perhaps we surveyed our users,
and a number of them decided not to answer.

Whatever the reason, it’s common for analysts to contend with missing
values. (Indeed, I’ve often heard analysts and data scientists say that 70-80
percent of their job involves cleaning, scaling, and otherwise manipulating
data so that they can use it.) While it would be nice to simply ignore those
missing values, that’s not always possible. If we were to remove any record
with any missing data, then we might found ourselves without any data at all,
which is a problem.

How do we represent missing values in pandas? It’s tempting to use 0, but
as you can imagine, that will quickly cause trouble when we try to calculate
mean values. Instead, then, pandas uses something known as NaN, aka "not a
number." NaN is the pandas style for writing nan, a value that’s also available
in NumPy. Both names are aliases to the same strange value, a float that
cannot be converted into an integer, and that is not equal to itself.

In NumPy, we typically search for NaN values with the isnan function.
pandas has a different approach, though: We can replace the NaN values in a
series (or data frame) with the fillna method. And we can drop any row with
NaN values with the dropna method.

Both of these methods return a new series or data frame, rather than
modifying the original object. However, the new object you get back might not
have copied the data, which means that assigning to it might produce the
famous, dreaded SettingWithCopyWarning. If you plan to modify the series
or data frame that you get back from df.dropna, you should probably invoke
the copy method, just to be sure:

**df = df.dropna().copy()**

## Interpolation

When your data contains missing values, you have a few possible ways to handle this. You can
remove rows with missing values, but that might remove a large number of otherwise useful
rows. A standard alternative is , in which you replace with interpolation NaN values that are
likely to be close to the orignal ones. The values might be wrong, but but they will be roughly in
the right ballpark.

In [4]:
# Load the Data set
temps_df = pd.read_csv("data/nyc-temps.txt", squeeze=True)

temps_df.describe()



  temps_df = pd.read_csv("data/nyc-temps.txt", squeeze=True)


count    728.000000
mean      -1.050824
std        5.026357
min      -14.000000
25%       -4.000000
50%        0.000000
75%        2.000000
max       12.000000
Name: -1, dtype: float64

In [9]:
import numpy as np

import pandas as pd

# Creating a DataFrame from the temps data with a DF

fail_df = pd.DataFrame(
    {'temp': temps_df,
     'hour': [0, 3, 6, 9, 12, 15, 18, 24] * 91
     }
)


In [10]:
# Assign to NaN to records
fail_df.loc[(fail_df['hour'] == 3) | (fail_df['hour'] == 6), 'temp'] = float("NaN")


In [12]:
# Describe the new Temperature Column after setting NaN
fail_df['temp'].describe()

count    546.000000
mean      -1.049451
std        5.027934
min      -14.000000
25%       -4.000000
50%        0.000000
75%        2.000000
max       12.000000
Name: temp, dtype: float64

In [13]:
# Now lets fill the NaN records with values
fail_df = fail_df.interpolate()

fail_df['temp'].describe()

count    728.000000
mean      -1.050824
std        5.026357
min      -14.000000
25%       -4.000000
50%        0.000000
75%        2.000000
max       12.000000
Name: temp, dtype: float64

interpolate has a number of different ways in which it can interpolate values. In the standard,
default mode, it’ll look at any NaN value and fill it with the average of the numbers that come just
before and after it. This is particularly appropriate for our missing-hour temperatures, since
temperature values don’t vary all that much from hour to hour, and can be assumed to go on a
continuum, either rising or falling along a curve. By contrast, if you were to take the temperature
of the oven in your kitchen, it

### Selective Updating

In [104]:
fail_df.loc[(fail_df['temp'] < 0), ['temp']] = 0

fail_df[['temp']]

Unnamed: 0,temp
0,0.0
1,0.0
2,0.0
3,0.0
4,0.0
...,...
723,2.0
724,2.0
725,2.0
726,2.0
