# Pandas

* **Data manipulation**: to easily manipulate data, including selecting subsets of data, merging and joining datasets, transforming data, handling missing values
* **Data cleaning**: provides tools for cleaning messy datasets by handling missing data, removing duplicates, and converting data types
* **Data analysis**: statistical and mathematical functions for data analysis, including descriptive statistics, aggregation, grouping, and time series analysis
* **Data visualization**: While Pandas itself does not provide visualization capabilities, it integrates seamlessly with visualization libraries like Matplotlib and Seaborn, allowing you to easily create plots and charts to visualize your data. (You can use matplotlib functions inside the pandas environment, without loading matplotlib)


In [1]:
import pandas as pd
import numpy as np

## Series

* A one-dimensional array-like object that can hold data of any type (integers, floats, strings, etc.). It is similar to a NumPy array but with additional functionalities.
*  similar to an array, list, or column in a table
*  It will assign a labeled index to each item in the Series. By default, each item will receive an index label from 0 to N, where N is the length of the Series minus one.

In [4]:
# create a series
sir = pd.Series([6, 'shameless', 3478, 'Jeremy Allen White', 839])
print(sir)

# create series with specified index
sir = pd.Series([6, 'shameless', 3478, 'Jeremy Allen White', 839], 
                index = ['G', 'O', 'A','L','S'])
print(sir)

0                     6
1             shameless
2                  3478
3    Jeremy Allen White
4                   839
dtype: object
G                     6
O             shameless
A                  3478
L    Jeremy Allen White
S                   839
dtype: object


In [19]:
# dictionary to series
dict = {'Chicago': 1000, 'New York': 1300, 'Portland': 900, 'San Francisco': 1100, 'Austin': 450, 'Boston': None}

# Series uses dictionary keys as index
city = pd.Series(dict)
print(city)

# use index to select specific element
print(city['Chicago'])

# use index to select specific elements -> double square brackets needed!
print(city[['Chicago', 'Austin']])

# boolean indexing
less = city<1000
print(less)

print(city[less])
print(city[city<1000])

# edit series elements based on index
city['Portland'] = 55555555
print(city['Portland'])

# edit series elements using boolean indexing
city[city > 1000] = 2
print(city)

# boolean test if a specific element is in the Series
print('Los Angeles' in city)
print('Chicago' in city)

# mathematical operations using scalars and functions
print(city/2)
np.square(city)

# adding series with shared index
print(city[['Chicago', 'Portland']] + city[['Austin', 'Portland']])




Chicago          1000.0
New York         1300.0
Portland          900.0
San Francisco    1100.0
Austin            450.0
Boston              NaN
dtype: float64
1000.0
Chicago    1000.0
Austin      450.0
dtype: float64
Chicago          False
New York         False
Portland          True
San Francisco    False
Austin            True
Boston           False
dtype: bool
Portland    900.0
Austin      450.0
dtype: float64
Portland    900.0
Austin      450.0
dtype: float64
55555555.0
Chicago          1000.0
New York            2.0
Portland            2.0
San Francisco       2.0
Austin            450.0
Boston              NaN
dtype: float64
False
True
Chicago          500.0
New York           1.0
Portland           1.0
San Francisco      1.0
Austin           225.0
Boston             NaN
dtype: float64
Austin      NaN
Chicago     NaN
Portland    4.0
dtype: float64


## DataFrame

* two-dimensional labeled data structure with columns of potentially different types
* similar to a spreadsheet or SQL table, and it is the primary object for data manipulation and analysis in Pandas
* tabular data structure
* group of Series objects that share an index i.e. the column names

In [21]:
# dictionary to data frame

dicttoframe = {'year': [2010, 2011, 2012, 2011, 2012, 2010, 2011, 2012], 
        'team': ['Bears', 'Bears', 'Bears', 'Packers', 'Packers', 'Lions', 'Lions', 'Lions'],
        'wins': [11, 8, 10, 15, 11, 6, 10, 4],
        'losses': [5, 8, 6, 1, 5, 10, 6, 12]}

football = pd.DataFrame(dicttoframe, columns=['year', 'team', 'wins', 'losses'])
print(football)


   year     team  wins  losses
0  2010    Bears    11       5
1  2011    Bears     8       8
2  2012    Bears    10       6
3  2011  Packers    15       1
4  2012  Packers    11       5
5  2010    Lions     6      10
6  2011    Lions    10       6
7  2012    Lions     4      12


In [23]:
# csv to dataframe
csvtoframe = '06.2 df.csv'

# open file in readmode
readmode = open(csvtoframe, 'r')

# read entire contents of the file into a string variable
contentstring = readmode.read()

# print characters of the file contents
print(contentstring[:100])




a,b,c,d
0.33627233637218457,0.3250110687231613,0.0010196408377848298,0.40140189720154196
0.980264968
