# Lab2 Numpy and Pandas

## 1 Numpy

Numpy is a Python package used for working with arrays. And Numpy is short for "NUMerical PYthon".

Numpy is documented [here](https://numpy.org/doc/stable/).

In [4]:
import numpy as np # standard naming convention for numpy

### 1.1 Creating Numpy Arrays

In [5]:
# 1-D array
x = np.array([1, 2, 3])
x

array([1, 2, 3])

In [6]:
type(x)

numpy.ndarray

We can create arrays with a defined data type.

In [4]:
y = np.array([1, 2, 3], dtype = "float64")
y

array([1., 2., 3.])

The Numpy array objects have an attribute called `.dtype` that returns the data type of the array.

In [5]:
y.dtype

dtype('float64')

The entries in a Numpy array must have the **same** data type.

In [6]:
np.array([1, True, 2.31]) # recast all entries to float

array([1.  , 1.  , 2.31])

In [7]:
np.array([1, 2.0, "horse"]) # recast all entries to string

array(['1', '2.0', 'horse'], dtype='<U32')

We can create multi-dimensional arrays from nested lists.

In [8]:
# 2-D array
z = np.array([[1, 2, 3], [4, 5, 6]])
z

array([[1, 2, 3],
       [4, 5, 6]])

Numpy also provides other array creation functions. See the [documentation](https://numpy.org/doc/stable/user/basics.creation.html) for a full list.

In [9]:
np.arange(2, 3, 0.1)

array([2. , 2.1, 2.2, 2.3, 2.4, 2.5, 2.6, 2.7, 2.8, 2.9])

In [10]:
np.diag([1, 2, 3])

array([[1, 0, 0],
       [0, 2, 0],
       [0, 0, 3]])

### 1.2 Inspecting Numpy Arrays

The array attributes `.shape`, `.ndim`, `.size` contain information about the structure of the array.

In [11]:
z

array([[1, 2, 3],
       [4, 5, 6]])

In [12]:
z.shape # shape of an array in a tuple format

(2, 3)

In [13]:
z.ndim # number of array dimensions

2

In [14]:
z.size # number of elements in the array

6

We can use `.reshape()` method to change the shape of the array.

In [15]:
z.reshape(3, 2)

array([[1, 2],
       [3, 4],
       [5, 6]])

### 1.3 Indexing and Slicing

We can access array elements by referring to index numbers.

In [15]:
a = np.arange(1, 10).reshape(3,3)
a

array([[1, 2, 3],
       [4, 5, 6],
       [7, 8, 9]])

In [16]:
a[:, 0] # first column

array([1, 4, 7])

In [17]:
a[1, :] # second row

array([4, 5, 6])

We can filter array elements using a boolean filter array.

In [18]:
filter_arr = a % 2 == 1
filter_arr

array([[ True, False,  True],
       [False,  True, False],
       [ True, False,  True]])

In [19]:
a[filter_arr] # return only odd elements

array([1, 3, 5, 7, 9])

We can also select subsections of arrays using slicing.

In [20]:
a[:, 1:] # all rows, second through last column

array([[2, 3],
       [5, 6],
       [8, 9]])

Numpy uses reference semantics, just as <kbd>list</kbd> type objects. Please note that slicing creates a **view** of the original array, but creates a **copy** of the original list.

In [21]:
b = a[0:2, 0:2] # slicing creates view
b

array([[1, 2],
       [4, 5]])

In [22]:
b[0, 0] = 100
a # a is affected

array([[100,   2,   3],
       [  4,   5,   6],
       [  7,   8,   9]])

In [23]:
c = a[0:2, 0:2].copy() # create a copy
c

array([[100,   2],
       [  4,   5]])

In [24]:
c[0, 0] = 1
a # a is not affected

array([[100,   2,   3],
       [  4,   5,   6],
       [  7,   8,   9]])

We can use `.base` attribute to check whether the array is a copy or a view.

- The copy returns `None`.

- The view returns the original array.

The main difference between a copy and a view of an array is that the copy is a new array, and the view is just a view of the original array. The copy owns the data and any changes made to the copy will not affect original array, and any changes made to the original array will not affect the copy.

In [25]:
b.base # b is a view

array([100,   2,   3,   4,   5,   6,   7,   8,   9])

In [26]:
print(c.base) # c is a copy

None


## 2 Pandas

Pandas is a Python package that provides tools for manipulating tabular data. The name "pandas" is short for "PANel DAta".

Pandas is documented [here](https://pandas.pydata.org/pandas-docs/stable/).

In [12]:
import pandas as pd # standard naming convention for pandas

### 2.1 Series

A Pandas Series is a one-dimensional labeled array capable of holding data of any type (integer, floats, string, etc.). 

Every Series combines an index and corresponding data values.

In [33]:
# create series from list
x = pd.Series(["a", "b", "c", "d", "e"]) # index is 0, 1, 2, ... by default
x

0    a
1    b
2    c
3    d
4    e
dtype: object

In [34]:
type(x)

pandas.core.series.Series

In [35]:
y = pd.Series(["a", "b", "c", "d", "e"], index = range(1, 6)) # set different index
y

1    a
2    b
3    c
4    d
5    e
dtype: object

In [36]:
y.index

RangeIndex(start=1, stop=6, step=1)

In [37]:
y.values

array(['a', 'b', 'c', 'd', 'e'], dtype=object)

Pandas supports vectorized operations, but elements are **automatically aligned by index**.

In [38]:
x + y

0    NaN
1     ba
2     cb
3     dc
4     ed
5    NaN
dtype: object

### 2.2 DataFrame

A DataFrame is a table, 2-D array-like data structure containing an ordered collection of columns, each of which can be a different value type (numeric, string, boolean, etc.).

It represents tabular data as a collection of Series.

In [39]:
df = pd.DataFrame({"x": np.random.random(4), "y": ["a", "b", "c", "d"], "z": range(10, 17, 2)})
df

Unnamed: 0,x,y,z
0,0.788167,a,10
1,0.036173,b,12
2,0.399713,c,14
3,0.822182,d,16


In [68]:
type(df)

pandas.core.frame.DataFrame

### 2.3 Indexing and Slicing

Pandas supports different ways of [indexing and selecting data](https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html). 

In [40]:
z = pd.Series(range(1, 5), index = ["a", "b", "c", "d"])
z

a    1
b    2
c    3
d    4
dtype: int64

In [41]:
df

Unnamed: 0,x,y,z
0,0.788167,a,10
1,0.036173,b,12
2,0.399713,c,14
3,0.822182,d,16


`[ ]` indexing

In [42]:
z[1] # second entry of z

2

In [43]:
z["a"] # entry with label "a" of z

1

In [44]:
df["x"] # column with name "x" of df

0    0.788167
1    0.036173
2    0.399713
3    0.822182
Name: x, dtype: float64

In [45]:
df[0:2] # first and second rows of df

Unnamed: 0,x,y,z
0,0.788167,a,10
1,0.036173,b,12


`.iloc[]` is primarily indexed position based and uses the same slicing syntax as Numpy arrays.

In [46]:
z.iloc[1] # second entry of z

2

In [47]:
# endpoint is excluded
z.iloc[:3] # first through third entries of z

a    1
b    2
c    3
dtype: int64

In [48]:
df.iloc[0, 0] # top-left entry of df

0.7881672596450096

`.loc[]` is primarily label based. For slicing with labels, the **endpoint** is **inclusive**.

In [49]:
z.loc["b"] # entry with label "b" from z

2

In [50]:
# endpoint is included
z.loc["a":"c"] # entries with labels "a" through "c" from z

a    1
b    2
c    3
dtype: int64

In [51]:
df.loc[1, "y"] # entry corresponding to row with label 1 and column with name "y" from df

'b'

### 2.4 Removing Entries / Rows / Columns

Use `.drop()` to remove entries from Series by specify index labels and to remove rows or columns from DataFrame by specifying index or column names directly.

In [52]:
z

a    1
b    2
c    3
d    4
dtype: int64

In [53]:
z.drop("b") # remove entry with label "b"

a    1
c    3
d    4
dtype: int64

In [54]:
df

Unnamed: 0,x,y,z
0,0.788167,a,10
1,0.036173,b,12
2,0.399713,c,14
3,0.822182,d,16


In [55]:
df.drop(index = [0, 1]) # remove rows with index labels 0 and 1

Unnamed: 0,x,y,z
2,0.399713,c,14
3,0.822182,d,16


In [56]:
df.drop(columns = ["x"]) # remove column with name "x"

Unnamed: 0,y,z
0,a,10
1,b,12
2,c,14
3,d,16


### 2.5 Reading Data

Pandas provides functions for reading (and writing) a variety of common formats of data files. See the [documentation](https://pandas.pydata.org/docs/user_guide/io.html) for a full list.

In [13]:
dogs = pd.read_csv("/Users/sophiasun/Downloads/dogs_full.csv")
dogs.head() # display the first 5 rows of the data set

Unnamed: 0,breed,group,datadog,popularity_all,popularity,lifetime_cost,intelligence_rank,longevity,ailments,price,food_cost,grooming,kids,megarank_kids,megarank,size,weight,height
0,Border Collie,herding,3.64,45,39.0,20143.0,1.0,12.52,2.0,623.0,324.0,weekly,low,1.0,29.0,medium,,20.0
1,Border Terrier,terrier,3.61,80,61.0,22638.0,30.0,14.0,0.0,833.0,324.0,weekly,high,2.0,1.0,small,13.5,
2,Brittany,sporting,3.54,30,30.0,22589.0,19.0,12.92,0.0,618.0,466.0,weekly,medium,3.0,11.0,medium,35.0,19.0
3,Cairn Terrier,terrier,3.53,59,48.0,21992.0,35.0,13.84,2.0,435.0,324.0,weekly,high,4.0,2.0,small,14.0,10.0
4,Welsh Springer Spaniel,sporting,3.34,130,81.0,20224.0,31.0,12.49,1.0,750.0,324.0,weekly,high,5.0,4.0,medium,,18.0


### 2.6 Aggregation

Pandas provides several methods for aggregating data, such as `.sum()`, `.mean()`, `.median()`, `.min()`, `.max()`. They ignore missing values by default and operate over the requested axis (axis = 0 by default).

In [58]:
z

a    1
b    2
c    3
d    4
dtype: int64

In [59]:
z.mean() # mean of values in z

2.5

In [60]:
df

Unnamed: 0,x,y,z
0,0.788167,a,10
1,0.036173,b,12
2,0.399713,c,14
3,0.822182,d,16


In [61]:
df.mean(numeric_only = True) # numerical columns' means 

x     0.511559
z    13.000000
dtype: float64

In [62]:
df.max(numeric_only = True, axis = 1) # maximum of numerical entries in each row

0    10.0
1    12.0
2    14.0
3    16.0
dtype: float64

In [63]:
df.describe() # descriptive statistics for numerical columns

Unnamed: 0,x,z
count,4.0,4.0
mean,0.511559,13.0
std,0.37036,2.581989
min,0.036173,10.0
25%,0.308828,11.5
50%,0.59394,13.0
75%,0.796671,14.5
max,0.822182,16.0


### 2.7 Applying Functions

We can also use `.apply()` method to apply our own aggregation functions to columns or to rows.

In [8]:
df

NameError: name 'df' is not defined

In [65]:
df.apply(lambda x: pd.Series([min(x), max(x)], index = ["min", "max"]), axis = 0)

Unnamed: 0,x,y,z
min,0.036173,a,10
max,0.822182,d,16


### 2.8 Grouping

Use `.groupby()` method to group data before computing aggregate statistics.

In [7]:
dogs.head() # display the first 5 rows

NameError: name 'dogs' is not defined

In [67]:
dogs.groupby("group").mean(numeric_only = True)

Unnamed: 0_level_0,datadog,popularity_all,popularity,lifetime_cost,intelligence_rank,longevity,ailments,price,food_cost,megarank_kids,megarank,weight,height
group,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1
herding,2.732,99.88,43.5,20691.818182,21.8125,11.728824,2.235294,814.941176,490.9,40.3,42.6,36.666667,19.73
hound,2.373077,104.769231,52.692308,19365.769231,54.904762,10.793529,0.833333,746.571429,514.538462,54.769231,56.153846,63.833333,22.543478
non-sporting,2.488,82.210526,38.6,19315.8,46.714286,10.976,1.352941,930.5,409.2,46.3,42.8,27.928571,14.984375
sporting,2.976,87.428571,46.066667,20299.3125,27.782609,10.8956,1.04,760.125,510.866667,27.466667,17.266667,51.966667,21.276786
terrier,2.7875,100.25,58.416667,20504.333333,44.75,11.48,0.653846,905.76,389.916667,37.166667,39.583333,23.413043,13.78
toy,2.805385,54.052632,36.769231,19506.076923,48.647059,11.672941,1.0,686.894737,343.230769,35.076923,49.0,9.818182,10.533333
working,2.065,71.111111,32.285714,19164.6875,41.529412,9.465909,1.772727,1235.708333,721.5,66.571429,62.0,105.0,25.388889
