# pandas basic

This notebook goes through the basic of the pandas package

In [None]:
import numpy as np
import pandas as pd

## Series

A one-dimensional array like object

In [None]:
obj = pd.Series([4, 7, -5, 3])
obj

In [None]:
pd.Series(np.array([4, 7, -5, 3]))

In [None]:
obj.array

### Indexes

In [None]:
obj.index   # Getting the index of a Series

In [None]:
obj2 = pd.Series([4, 7, -5, 3], index=["d", "b", "a", "c"])
obj2

In [None]:
obj2.index

In [None]:
obj2["a"]

In [None]:
obj2["d"] = 6
obj2

In [None]:
obj2[["c", "a", "d"]]

In [None]:
obj2 > 0

In [None]:
obj2[obj2 > 0]

In [None]:
obj2 * 2

In [None]:
np.exp(obj2)

A Python dictionary to a Series:

In [None]:
sdata = {"Ohio": 35000, "Texas": 71000, "Oregon": 16000, "Utah": 5000}
obj3 = pd.Series(sdata)
obj3

In [None]:
obj3.to_dict()

## DataFrame

A DataFrame is a rectangular table of data (like a spreadsheet). It can also be viewed as an ordered, named collection of columns (with same index), where each column can have a potentially different type (numeric, string, Boolean, etc.) and each value in the column if of that type. DataFrames has both a row and column index like a spreadsheet.

In [None]:
data = {"state": ["Ohio", "Ohio", "Ohio", "Nevada", "Nevada", "Nevada"],
        "year": [2000, 2001, 2002, 2001, 2002, 2003],
        "pop": [1.5, 1.7, 3.6, 2.4, 2.9, 3.2]}
frame = pd.DataFrame(data)
frame

In [None]:
frame.head()

In [None]:
frame.tail()

In [None]:
frame.head(3)

### Indexes of DataFrames

Like for Series

In [None]:
frame.index 

### Selecting columns

A column of a DataFrame can be selected and turned into a Series:

In [None]:
frame["year"]

In [None]:
frame.year

In [None]:
frame["year"] = np.arange(6) + 1990   # using this selection to assign new values
frame

Assignment does not copy

In [None]:
frame2 = frame

In [None]:
frame2

In [None]:
frame2["year"] = np.arange(6) + 2000
frame2

In [None]:
frame

You can make a copy though:

In [None]:
frame2 = frame.copy()

In [None]:
frame2["year"] = np.arange(6) + 1990
frame2

In [None]:
frame

In [None]:
frame2["Country"] = "USA"
frame2

### Selecting (or slicing) rows



In [None]:
frame2

In [None]:
frame2.loc[2]   # Using the index

In [None]:
frame2.iloc[2]   # Using the position

In [None]:
frame2.iloc[2:5]   

Selecting both rows and columns

In [None]:
frame2.loc[2:5, ["pop", "year"]]   # Column names do not work with iloc

In [None]:
frame2.iloc[2:5, 1:3] 

In [None]:
frame2.loc[:, ["pop", "year"]]   # Selecting all rows

### Selecting and replacing individual values

In [None]:
frame2

In [None]:
frame2.iloc[1,2]

In [None]:
frame2.iloc[1,2] = 33.3
frame2

### Dropping rows or columns

In [None]:
frame2.drop(index=[2, 4])   # Creates a copy

In [None]:
frame2

In [None]:
frame2.drop(columns=["pop", "year"])

In [None]:
frame2.drop(["pop", "year"], axis="columns")

### Function Application and Mapping

In [None]:
frame = pd.DataFrame(np.random.standard_normal((4, 3)),
                     columns=list("bde"),
                     index=["Utah", "Ohio", "Texas", "Oregon"])
frame

In [None]:
np.abs(frame)   # Element wise application of functions

In [None]:
np.max(frame)

Functions can be applied along columns or rows:

In [None]:
frame.apply(np.max, axis="columns")

In [None]:
frame.apply(np.max, axis="rows")

Max and other simple statistical functions are DataFrame methods, so they can be applied without using apply:

In [None]:
frame.max()

In [None]:
frame.max(axis="columns")

### Sorting DataFrames based on columns

In [None]:
frame = pd.DataFrame({"b": [4, 7, -3, 2], "a": [0, 1, 0, 1]})
frame

In [None]:
frame.sort_values("b")

In [None]:
frame.sort_values(["a", "b"])

In [None]:
frame.sort_values(["a", "b"],  ascending=False)

### Calculating descriptive statistics of DataFrames

In [None]:
df = pd.DataFrame([[1.4, np.nan], [7.1, -4.5],
                   [np.nan, np.nan], [0.75, -1.3]],
                  index=["a", "b", "c", "d"],
                  columns=["one", "two"])
df

In [None]:
df.mean(axis="columns")

In [None]:
df.mean(axis="rows")

Sometimes you do not want to skip missing values, but let the missing value carry over:

In [None]:
df.mean(axis="columns", skipna=False)

In [None]:
df.idxmax(axis="rows")   # For each column returns the row index for which the max value is achieved

If a column contains other values than numerical values, we might be interested in counting how many of each values:

In [None]:
df2 = pd.DataFrame({"sample1" : ["c", "a", "d", "a", "a", "b", "b", "c", "c"], 
                    "sample2" : ["a", "a", "d", "a", "d", "d", "b", "a", "d"]})
df2

In [None]:
df2["sample1"].value_counts()

In [None]:
df2["sample1"].value_counts()

In [None]:
df2.value_counts()

There are other descripte statistical functions, but describe give us a decriptive overview of a DataFrame, something that often is higly valuable!

In [None]:
df.describe()

It cannot be applied along rows, but it is also almost always along columns we are interested in.

We can also calculate the correlation between two columns:

In [None]:
df["one"].corr(df["two"])

## Reading data into Python and pandas

Reading in a classic csv file:

In [None]:
df = pd.read_csv("ex1.csv")
df

Reading in csv files with semicolon seperators:

In [None]:
dfsc = pd.read_csv("ex1semicolon.csv")
dfsc

In [None]:
dfsc = pd.read_csv("ex1semicolon.csv", sep = ";")
dfsc

Reading in a file without a header:

In [None]:
df2 = pd.read_csv("ex1.csv", header = None)  # The first line is turned into data - you should know in advance if your file has a header!
df2

In [None]:
df2 = pd.read_csv("ex2.csv", header = None)
df2

Adding manual names when reading in without a header:

In [None]:
df2 = pd.read_csv("ex2.csv", names=["a", "b", "c", "d", "message"])   # We do not need the "header = None" anymore 
df2

Making a column into indexes:

In [None]:
df2 = pd.read_csv("ex2.csv", names=["a", "b", "c", "d", "message"], index_col="message")
df2

Skipping rows when reading in:

In [None]:
pd.read_csv("ex4.csv")

In [None]:
pd.read_csv("ex4.csv", skiprows=[0, 2, 3])

Writing to csv files:

In [None]:
df2

In [None]:
df2.to_csv("out.csv")

In [None]:
df2.to_csv("out.csv", index=False)

Reading in excel files:

In [None]:
pd.read_excel("ex1excel.xlsx")

In [None]:
pd.read_excel("ex1excel.xlsx", index_col=0)