# Pandas

Pandas is a newer package built on top of NumPy, and provides an efficient implementation of a DataFrame.
DataFrames are essentially multidimensional arrays with attached row and column labels.
Series and DataFrame objects, builds on the NumPy array structure and provides efficient access to these sorts of "data munging" tasks.



## The Pandas Series Object
A Pandas Series is a one-dimensional array of indexed data (similar to dictionary). It can be created from a list or array.

In [None]:
import numpy as np
import pandas as pd

In [None]:
data = pd.Series([0.25, 0.50, 0.75, 1.00])

print(data.values) # gives values of the series
print(data.index) # gives indexes of the series, same with print(data.keys())
print(list(data.items()))
print()

print(data[1:3]) # we can use slicing
print()

print(data[(data > 0.3) & (data < 0.8)]) # we can use masking
#%%
import numpy as np
import pandas as pd


In [None]:
data = pd.Series([0.25, 0.50, 0.75, 1.00], index=["a", "b", "c", "d"], name="data") # we can adjust indexing type of a series object
print(data["b":"c"]) # same with print(data[1:3]), it will include "c"
print()

data = pd.Series({"a":1, "b":2}) # we can use a dictionary to create a series
print(data)

## The Pandas Dataframe Object

If a Series is an analog of a one-dimensional array with flexible indices, a DataFrame is an analog of a two-dimensional array with both flexible row indices and flexible column names.

In [None]:
population_dict = {'California': 38332521,
                   'Texas': 26448193,
                   'New York': 19651127,
                   'Florida': 19552860,
                   'Illinois': 12882135}
population = pd.Series(population_dict)

area_dict = {'California': 423967, 'Texas': 695662, 'New York': 141297,
             'Florida': 170312, 'Illinois': 149995}
area = pd.Series(area_dict)

states = pd.DataFrame({'population': population,
                       'area': area})
print(states)
print(states.index) # gives the rows of the dataframe
print(states.columns) # gives the columns of the dataframe
print(states.area) # same with states["area"]
print(states["population"]) # same with states.population

states["density"] = states["population"] / states["area"] # we can add a new column by doing an operation

print(states)

# Even if some keys in the dictionary are missing, Pandas will fill them in with NaN

An example

In [None]:
# to access the data of a dataframe, we can use this method
points_table = {'Team':['MI', 'CSK', 'Devils', 'MI', 'CSK',
   'RCB', 'CSK', 'CSK', 'KKR', 'KKR', 'KKR', 'RCB'],
   'Rank' :[1, 2, 2, 3, 3, 4, 1, 1, 2, 4, 1, 2],
   'Year' :[2014,2015,2014,2015,2014,2015,2016,2017,2016,2014,2015,2017],
   'Point':[876,789,863,673,741,812,756,788,694,701,804,690]}
df = pd.DataFrame(points_table)

# df.head() gives the first n rows of the data, default=5
# df.tail() gives the last n rows of the data, default=5
# df.sample() gives random n rows of the data, default=5

print(df.columns, end="\n\n") # this will return all columns of the df as an Index object

# So we can access the columns of the data
print(df[0:2], end="\n\n") # accesing rows by using indexing

# To choose certain columns and rows of a dataframe
teams = df[["Team", "Rank"]].iloc[[2, 4, 6]] # for a dataframe: first index is column, for iloc and loc: first index is row
print(teams, end="\n\n")

# We can narrow down the df by using masking
best = df.loc[df['Rank'] == 3] # we can use comparison operators(|(or), &(and))
print(best, end="\n\n")

teams = df["Team"].tolist() # we can also manipulate the data by turning into a list
teams[0] = "RMA"
df["Team"] = teams
print(df.head(), end="\n\n")

to read a csv file : dataframe = pd.read_csv("data.csv")

dataframe.isna().sum()

We use the drop(), dropna(), fill(), fillna() functions for deleting and filling the specific column or to delete the multiple columns at the same time.

DataFrame.drop(labels=None, axis=0, index=None, columns=None, level=None, inplace=False, errors=’raise’)\
labels: single label or list (index or name of the columns to drop)\
axis: {0 or ‘index’, 1 or ‘columns’}, it’s default value is 0\
columns: It is the same as the label or we can say that it is an alternative to specify the names of the attributes (colums=labels)\
level: If there are multiple indexes present in the DataFrame then we will pass the level\
inplace: If false then return a copy. Otherwise do operation inplace and return none

An example

In [None]:
df = pd.read_csv("data.csv")

newDF = df.copy().head(20) # to return a copy not view

M = newDF["Calories"].mean()
newDF["Calories"].fillna(M, inplace=True)

# inplace: If false then return a copy. Otherwise do operation inplace and return none.

newDF["Pulse"].fillna(0, inplace=True)

newDF = newDF.dropna() # default: inplace=False, it returns a copy.

print(newDF)
print()
last = newDF.loc[newDF["Maxpulse"] > 135]
print(last)

A Pandas DataFrame operates much like a structured array, and can be created directly from one

In [None]:
A = np.zeros(3, dtype=[('A', 'i8'), ('B', 'f8')])

data = pd.DataFrame(A)
print(data)

## The Pandas Index Object
Immutable array, duplication allowed

In [None]:
indA = pd.Index([1, 3, 5, 7, 9])
indB = pd.Index([2, 3, 5, 7, 11])
indC = indA.intersection(indB) # union() for
print(indC)

## Indexers: loc, iloc

In [None]:
data = pd.Series(['a', 'b', 'c'], index=[1, 3, 5])

# loc attribute allows indexing and slicing that always references the explicit index(starting from 1)
print(data.loc[1:3])
print()

# iloc attribute allows indexing and slicing that always references the implicit Python-style index(starting from 0)
print(data.iloc[1:3])

## Handling Missing Data
None is a Python singleton object that is often used for missing data in Python code.
Because it is a Python object, None cannot be used in any arbitrary NumPy/Pandas array, but only in arrays with data type 'object'

In [None]:
vals1 = np.array([1, None, 3, 4]) # we can create an array with a None object


for dtype in ['object', 'int']:
    np.arange(1E6, dtype=dtype).sum() # 1E6 = 1*10^6.0
# This dtype=object means that the best common type representation NumPy could infer for the contents of the array is that they are Python objects.
# While this kind of object array is useful for some purposes, any operations on the data will be done at the Python level

# The other missing data representation, NaN (acronym for Not a Number)

vals2 = np.array([1, np.nan, 3, 4]) # dtype is float64 so it runs faster

# vals2.sum(), vals2.min(), vals2.max() (they will return nan)
# to deal with nan values we should use np.nansum(vals2), np.nanmin(vals2), np.nanmax(vals2) (ignore nan values)

In [None]:

A = pd.Series([2, 4, 6], index=[0, 1, 2])
B = pd.Series([1, 3, 5], index=[1, 2, 3])
C = A + B
print(C, end="\n\n")

C = A.add(B, fill_value=0) # we filled NaN values with 0
print(C)

# a number + NaN = NaN
# a number *  np.nan = nan

### Operating on Null Values
As we have seen, Pandas treats None and NaN as essentially interchangeable for indicating missing or null values.\
Python uses the keyword None to define null objects and variables.\
While None does serve some of the same purposes as null in other languages but as the null in Python, None is not defined to be 0 or any other value.\
In Python, None is an object. To check if something is None, we should use the is / is not identity operator

In [None]:
# isnull(): Generate a boolean mask indicating missing values
# notnull(): Opposite of isnull()
# dropna(): Return a filtered version of the data
# fillna(): Return a copy of the data with missing values filled or imputed

# Filling null values
data = pd.Series([1, np.nan, 2, None, 3], index=list('abcde'))
print(data.fillna(0)) # returns a copy of series with filled nan values

# forward-fill (returns a copy of series with forward-filled nan values)
data.fillna(method='ffill')
# back-fill (returns a copy of series with back-filled nan values)
data.fillna(method='bfill')

## Hierarchical Indexing: Pandas MultiIndex

In [None]:

# Suppose you would like to track data about states from two different years. This is the bad way for doing it
index = [('California', 2000), ('California', 2010),
         ('New York', 2000), ('New York', 2010),
         ('Texas', 2000), ('Texas', 2010)]
populations = [33871648, 37253956,
               18976457, 19378102,
               20851820, 25145561]
pop = pd.Series(populations, index=index)
print(pop)
print()

index = pd.MultiIndex.from_tuples(index) # we can use MultiIndex to track the data
pop = pop.reindex(index)
print(pop)
print()

# You might notice something else here: we could easily have stored the same data using a simple DataFrame with index and column labels.
# In fact, Pandas is built with this equivalence in mind. The unstack() method will quickly convert a multiply indexed Series into a conventionally indexed DataFrame.
pop_df = pop.unstack()

# Naturally, the stack() method provides the opposite operation
pop_df.stack()
#%%
import numpy as np
import pandas as pd

# In a DataFrame, the rows and columns are completely symmetric, and just as the rows can have multiple levels of indices, the columns can have multiple levels as well.

# hierarchical indices and columns
index = pd.MultiIndex.from_product([[2013, 2014], [1, 2]],
                                   names=['year', 'visit'])
columns = pd.MultiIndex.from_product([['Bob', 'Guido', 'Sue'], ['HR', 'Temp']],
                                     names=['subject', 'type'])

# mock some data
data = np.round(np.random.randn(4, 6), 1)
data[:, ::2] *= 10
data += 37

# create the DataFrame
health_data = pd.DataFrame(data, index=index, columns=columns)
print(health_data)

# Many of the MultiIndex slicing operations will fail if the index is not sorted.
# So we use: data = data.sort_index()

## Combining Datasets
Pandas has a function, pd.concat(), which has a similar syntax to np.concatenate that can be used for simple concatenations of arrays

In [None]:
# pd.concat() can be used for a simple concatenation of Series or DataFrame objects
# pd.concat(objs, axis=0, join='outer', join_axes=None, ignore_index=False,
#           keys=None, levels=None, names=None, verify_integrity=False,
#           copy=True)

ser1 = pd.Series(['A', 'B', 'C'], index=[1, 2, 3])
ser2 = pd.Series(['D', 'E', 'F'], index=[4, 5, 6])
pd.concat([ser1, ser2])

# If you'd like to simply verify that the indices in the result of pd.concat() do not overlap, you can specify the verify_integrity flag.
# Sometimes the index itself does not matter, and you would prefer it to simply be ignored. This option can be specified using the ignore_index flag.

# Appending the data: ser1.append(ser2)
# Keep in mind that unlike the append() and extend() methods of Python lists, the append() method in Pandas does not modify the original object–instead it creates a new object with the combined data.
# It also is not a very efficient method, because it involves creation of a new index and data buffer.
# Thus, if you plan to do multiple append operations, it is generally better to build a list of DataFrames and pass them all at once to the concat() function.

### Merge and Join

In [None]:
class display(object):
    """Display HTML representation of multiple objects"""
    template = """<div style="float: left; padding: 10px;">
    <p style='font-family:"Courier New", Courier, monospace'>{0}</p>{1}
    </div>"""
    def __init__(self, *args):
        self.args = args
        
    def _repr_html_(self):
        return '\n'.join(self.template.format(a, eval(a)._repr_html_())
                         for a in self.args)
    
    def __repr__(self):
        return '\n\n'.join(a + '\n' + repr(eval(a))
                           for a in self.args)

df1 = pd.DataFrame({'employee': ['Bob', 'Jake', 'Lisa', 'Sue'],
                    'group': ['Accounting', 'Engineering', 'Engineering', 'HR']})
df2 = pd.DataFrame({'employee': ['Lisa', 'Bob', 'Jake', 'Sue'],
                    'hire_date': [2004, 2008, 2012, 2014]})
print(display('df1', 'df2'))
print()

# To combine this information into a single DataFrame, we can use the pd.merge() function.
df3 = pd.merge(df1, df2)
print(df3)
# The pd.merge() function recognizes that each DataFrame has an "employee" column, and automatically joins using this column as a key.
print("-----------------------------------------------")
# Most simply, you can explicitly specify the name of the key column using the on keyword
print(display('df1', 'df2', "pd.merge(df1, df2, on='employee')"))
print("-----------------------------------------------")

# For convenience, DataFrames implement the join() method, which performs a merge that defaults to joining on indices
# df1.join(df2)

df6 = pd.DataFrame({'name': ['Peter', 'Paul', 'Mary'],
                    'food': ['fish', 'beans', 'bread']},
                   columns=['name', 'food'])
df7 = pd.DataFrame({'name': ['Mary', 'Joseph'],
                    'drink': ['wine', 'beer']},
                   columns=['name', 'drink'])
print(display('df6', 'df7', 'pd.merge(df6, df7)'))
# Here we have merged two datasets that have only a single "name" entry in common: Mary.
# By default, the result contains the intersection of the two sets of inputs; this is what is known as an inner join.
# We can specify this explicitly using the how keyword, which defaults to "inner"

# Other options for the how keyword are 'outer', 'left', and 'right'.
# An outer join returns a join over the union of the input columns, and fills in all missing values with NAs
print(display('df6', 'df7', "pd.merge(df6, df7, how='outer')"))

# The left join and right join return joins over the left entries and right entries, respectively.
print(display('df6', 'df7', "pd.merge(df6, df7, how='left')"))

# Finally, you may end up in a case where your two input DataFrames have conflicting column names.
# the merge function automatically appends a suffix _x or _y to make the output columns unique.
# If these defaults are inappropriate, it is possible to specify a custom suffix using the suffixes keyword

# Examples on https://jakevdp.github.io/PythonDataScienceHandbook/03.07-merge-and-join.html

# head() function returns the first n row of the data, default 5
# tail() function returns the last n row of the data, default 5
# sample() function returns randomly rows of the data

### Aggregation and Grouping

In [None]:
df = pd.DataFrame({'key': ['A', 'B', 'C', 'A', 'B', 'C'],
                   'data': range(6)}, columns=['key', 'data'])
print(df)
print()
print(df.groupby('key')) # It gives an DataFrameGroupBy object which means that common aggregates can be implemented very efficiently
# To produce a result, we can apply an aggregate to this DataFrameGroupBy object
print(df.groupby('key').sum())
print("-----------------------------------------------------------------")

# GroupBy objects have aggregate(), filter(), transform(), and apply() methods that efficiently implement a variety of useful operations before combining the grouped data.

rng = np.random.RandomState(0)
df = pd.DataFrame({'key': ['A', 'B', 'C', 'A', 'B', 'C'],
                   'data1': range(6),
                   'data2': rng.randint(0, 10, 6)},
                   columns = ['key', 'data1', 'data2'])

print(df)
print("-----------------------------------------------------------------")

# Aggregation: Birleştirmek
# aggregate() method can take a string, a function, or a list thereof, and compute all the aggregates at once.
df1 = df.groupby('key').aggregate({'data1': 'min',
                                   'data2': 'max'})
print(df1)
print("-----------------------------------------------------------------")

# Filtering
# A filtering operation allows you to drop data based on the group properties.
def filter_func(x):
    return x['data2'].std() > 4

print(display('df', "df.groupby('key').std()", "df.groupby('key').filter(filter_func)"))
# The filter function should return a Boolean value specifying whether the group passes the filtering.
# Here because group A does not have a standard deviation greater than 4, it is dropped from the result.
print("-----------------------------------------------------------------")

# Transformation
# While aggregation must return a reduced version of the data, transformation can return some transformed version of the full data to recombine.
df2 = df.groupby('key').transform(lambda x: x - x.mean())
print(df2)
print("-----------------------------------------------------------------")

# The apply() method
# The apply() method lets you apply an arbitrary function to the group results.
# The function should take a DataFrame, and return either a Pandas object (e.g., DataFrame, Series) or a scalar; the combine operation will be tailored to the type of output returned.
def norm_by_data2(x):
    # x is a DataFrame of group values
    x['data1'] /= x['data2'].sum()
    return x

print(display('df', "df.groupby('key').apply(norm_by_data2)"))

## Pivot Tables
A pivot table is a similar operation that is commonly seen in spreadsheets and other programs that operate on tabular data.

In [None]:
# We'll use the database of passengers on the Titanic, available through the Seaborn library
import seaborn as sns

In [None]:

titanic = sns.load_dataset('titanic')

# Pivot Table Syntax
print(titanic.pivot_table('survived', index='sex', columns='class'))
print("----------------------------------------")

# Multi-level pivot tables
age = pd.cut(titanic['age'], [0, 18, 80])
print(titanic.pivot_table('survived', ['sex', age], 'class'))

# DataFrame.pivot_table(data, values=None, index=None, columns=None,
#                      aggfunc='mean', fill_value=None, margins=False,
#                      dropna=True, margins_name='All')
# margins gives All column which gives sum of values at the same row

## Data Loading into a Pandas DataFrame

In [None]:
df = pd.read_csv('data.csv', sep=",") # we can use sep attribute to seperate the data by a specific seperator
# skiprows=[]
# 
print(df.head()) # on jupyter, we can read this data by: !more data.csv

# df.to_csv('out.csv') to create a new file and read the data in it.

### csv library

In [None]:
import csv

In [None]:
f = open('data.csv')

reader = csv.reader(f) # this will read the file as a csvreader object

with open('data.csv') as f:
    lines = list(csv.reader(f))
print(lines[0:5])

# JSON: https://www.w3schools.com/js/js_json_intro.asp more popular
# XML: https://www.w3schools.com/xml/xml_whatis.asp
# HTML: https://www.w3schools.com/html/html_intro.asp

### JSON

In [None]:
import json

In [None]:
obj = """
{"name": "Wes",
 "places_lived": ["United States", "Spain", "Germany"],
 "pet": null,
 "siblings": [{"name": "Scott", "age": 30, "pets": ["Zeus", "Zuko"]},
              {"name": "Katie", "age": 38,
               "pets": ["Sixes", "Stache", "Cisco"]}]
}
"""
result = json.loads(obj)
print(result)

# tables = pd.read_html('examples/fdic_failed_bank_list.html')

### Binary Data Formats
One of the easiest ways to store data (serialization) efficiently in binary format is using pickle built-in library on Python.

In [None]:
frame = pd.read_csv('data.csv') # by using pandas, we can turn the data into a pickle
frame.to_pickle('examples/frame_pickle')

### Reading Microsoft Excel Files
To read data from a excel file\
xlsx = pd.ExcelFile('examples/ex1.xlsx')\
pd.read_excel(xlsx, 'Sheet1')\
\
To create and write data in a excel file\
writer = pd.ExcelWriter('examples/ex2.xlsx')\
frame.to_excel(writer, 'Sheet1')\
writer.save()

## Interacting with Web APIs

In [None]:
import requests

In [None]:
url = 'https://api.github.com/repos/pandas-dev/pandas/issues'
resp = requests.get(url)
print(resp)

data = resp.json()
print(data[0:5])

issues = pd.DataFrame(data, columns=['number', 'title',
                                     'labels', 'state'])
print(issues.head())

# https://pandas.pydata.org/docs/user_guide/io.html