## Libraries and Pandas



__Our goals today are to be able to:__

- Identify and import Python modules and packages (libraries)
- Investigate table data in Pandas
- Manipulate Pandas DataFrames and Series

## Libraries (Packages)

### Terminology

![mod2](img/modules2.png)



### Terminology

![packages3](img/packages3.png)

### pip & the Python Package Index

[Python Package Index](https://pypi.org/)

<img src="img/pypi_packages.png" width=600>

__You can also write your own modules__

Make your own modules
![pipmod](img/import_modules.png)

![pippack](img/package_redo.png)

## Pandas

<img src="https://cdn-images-1.medium.com/max/1600/1*9IU5fBzJisilYjRAi-f55Q.png" width=600>  

__Why not spreadsheets?__

[5 and Half Reasons to Ditch the Spreadsheet](https://lucidmanager.org/spreadsheets-for-data-science/)

### Installing and Using Pandas

In [6]:
import pandas as pd

In [7]:
pd.__version__

'1.0.3'

In [2]:
pandas.__version__

'1.0.3'

In [3]:
## Why pandas
pd?

Object `pd` not found.


In [4]:
## convention

import pandas as pd 

### Main Data Structures in Pandas: Series, DataFrame and Index

#### Series

In [9]:
data = pd.Series([0.25, 0.5, 0.75, 1.0])
data

0    0.25
1    0.50
2    0.75
3    1.00
dtype: float64

In [10]:
type(data)

pandas.core.series.Series

In [12]:
data[2]

0.75

In [14]:
val = data.values

In [15]:
type(val)

numpy.ndarray

In [23]:
data.values

array([0.25, 0.5 , 0.75, 1.  ])

In [22]:
val.values

AttributeError: 'numpy.ndarray' object has no attribute 'values'

In [17]:
ind = data.index

In [19]:
ind

RangeIndex(start=0, stop=4, step=1)

In [18]:
type(ind)

pandas.core.indexes.range.RangeIndex

In [8]:
data[3]

1.0

In [24]:
data = pd.Series([0.25, 0.5, 0.75, 1.0],
                 index=['a', 'b', 'c', 'd'])
data

a    0.25
b    0.50
c    0.75
d    1.00
dtype: float64

In [21]:
data['b']

0.5

In [25]:
population_dict = {'California': 38332521,
                   'Texas': 26448193,
                   'New York': 19651127,
                   'Florida': 19552860,
                   'Illinois': 12882135}
population = pd.Series(population_dict)
population

California    38332521
Texas         26448193
New York      19651127
Florida       19552860
Illinois      12882135
dtype: int64

[For more on Series](https://github.com/jakevdp/PythonDataScienceHandbook/blob/master/notebooks/03.01-Introducing-Pandas-Objects.ipynb)

#### Pandas

The next fundamental structure in Pandas is the DataFrame. Like the Series object discussed in the previous section, the DataFrame can be thought of either as a generalization of a NumPy array, or as a specialization of a Python dictionary. If a Series is an analog of a one-dimensional array with flexible indices, a DataFrame is an analog of a two-dimensional array with both flexible row indices and flexible column names.

In [29]:
a = (1,2,3,5)

In [30]:
type(a)

tuple

In [31]:
population

California    38332521
Texas         26448193
New York      19651127
Florida       19552860
Illinois      12882135
dtype: int64

In [28]:
type(population)

pandas.core.series.Series

In [32]:


area_dict = {'California': 423967, 'Texas': 695662, 'New York': 141297,
             'Florida': 170312, 'Illinois': 149995}
area = pd.Series(area_dict)
area


states = pd.DataFrame({'population': population,
                       'area': area})
states

Unnamed: 0,population,area
California,38332521,423967
Texas,26448193,695662
New York,19651127,141297
Florida,19552860,170312
Illinois,12882135,149995


In [33]:
type(states)

pandas.core.frame.DataFrame

In [36]:
dir(population)

['California',
 'Florida',
 'Illinois',
 'T',
 'Texas',
 '_AXIS_ALIASES',
 '_AXIS_IALIASES',
 '_AXIS_LEN',
 '_AXIS_NAMES',
 '_AXIS_NUMBERS',
 '_AXIS_ORDERS',
 '_AXIS_REVERSED',
 '_HANDLED_TYPES',
 '__abs__',
 '__add__',
 '__and__',
 '__annotations__',
 '__array__',
 '__array_priority__',
 '__array_ufunc__',
 '__array_wrap__',
 '__bool__',
 '__class__',
 '__contains__',
 '__copy__',
 '__deepcopy__',
 '__delattr__',
 '__delitem__',
 '__dict__',
 '__dir__',
 '__div__',
 '__divmod__',
 '__doc__',
 '__eq__',
 '__finalize__',
 '__float__',
 '__floordiv__',
 '__format__',
 '__ge__',
 '__getattr__',
 '__getattribute__',
 '__getitem__',
 '__getstate__',
 '__gt__',
 '__hash__',
 '__iadd__',
 '__iand__',
 '__ifloordiv__',
 '__imod__',
 '__imul__',
 '__init__',
 '__init_subclass__',
 '__int__',
 '__invert__',
 '__ior__',
 '__ipow__',
 '__isub__',
 '__iter__',
 '__itruediv__',
 '__ixor__',
 '__le__',
 '__len__',
 '__long__',
 '__lt__',
 '__matmul__',
 '__mod__',
 '__module__',
 '__mul__',
 '__ne__'

In [37]:
population.columns

AttributeError: 'Series' object has no attribute 'columns'

In [38]:
states.columns

Index(['population', 'area'], dtype='object')

[Difference between Dataframe and Series](https://stackoverflow.com/questions/26047209/what-is-the-difference-between-a-pandas-series-and-a-single-column-dataframe)

### Importing and Reading Data with Pandas

In [39]:
## Let's check the current directory first
%pwd

'/Users/muratguner/Documents/lectures/dc-ds-060120/mod-1/day-3/second_session'

In [40]:
## Let's see the files in the current directory
%ls -la

total 128
drwxr-xr-x   8 muratguner  staff    256 Jun  4 15:04 [1m[36m.[m[m/
drwxr-xr-x   9 muratguner  staff    288 Jun  4 14:00 [1m[36m..[m[m/
-rw-r--r--@  1 muratguner  staff   6148 Jun  4 14:00 .DS_Store
drwxr-xr-x   4 muratguner  staff    128 Jun  4 14:08 [1m[36m.ipynb_checkpoints[m[m/
-rw-r--r--   1 muratguner  staff  33547 Jun  4 15:04 Libraries and Pandas-060120.ipynb
-rw-r--r--   1 muratguner  staff  17350 Jun  4 14:50 Libraries and Pandas.ipynb
drwxr-xr-x   4 muratguner  staff    128 Mar  8 21:06 [1m[36mdata[m[m/
drwxr-xr-x  12 muratguner  staff    384 Mar  8 21:06 [1m[36mimg[m[m/


In [43]:
import pandas as pd

## made up jobs
muj_df = pd.read_csv('data/made_up_jobs.csv', )


csv: comma separated values

- Notice: if there is a text then csv might be problem

- Even though it has its handicaps it is pretty common format.

In [44]:
muj_df

Unnamed: 0,ID,Name,Job,Years Employed
0,0,Bob Bobberty,Underwater Basket Weaver,13
1,1,Susan Smells,Salad Spinner,5
2,2,Alex Lastname,Productivity Manager,2
3,3,Rudy P.,Being cool,55
4,4,Rudy G.,Being compared to Rudy P,50
5,5,Sir Wellington,Cheese Stacker,10


We can read a lot of different types of files with pandas: Some examples might be: read_excel, read_html, ect.

In [16]:
type(pd)

module

In [17]:
## Let's take a look at the attributes of Pandas module
dir(pd)

['BooleanDtype',
 'Categorical',
 'CategoricalDtype',
 'CategoricalIndex',
 'DataFrame',
 'DateOffset',
 'DatetimeIndex',
 'DatetimeTZDtype',
 'ExcelFile',
 'ExcelWriter',
 'Float64Index',
 'Grouper',
 'HDFStore',
 'Index',
 'IndexSlice',
 'Int16Dtype',
 'Int32Dtype',
 'Int64Dtype',
 'Int64Index',
 'Int8Dtype',
 'Interval',
 'IntervalDtype',
 'IntervalIndex',
 'MultiIndex',
 'NA',
 'NaT',
 'NamedAgg',
 'Period',
 'PeriodDtype',
 'PeriodIndex',
 'RangeIndex',
 'Series',
 'SparseDtype',
 'StringDtype',
 'Timedelta',
 'TimedeltaIndex',
 'Timestamp',
 'UInt16Dtype',
 'UInt32Dtype',
 'UInt64Dtype',
 'UInt64Index',
 'UInt8Dtype',
 '__builtins__',
 '__cached__',
 '__doc__',
 '__docformat__',
 '__file__',
 '__getattr__',
 '__git_version__',
 '__loader__',
 '__name__',
 '__package__',
 '__path__',
 '__spec__',
 '__version__',
 '_config',
 '_hashtable',
 '_is_numpy_dev',
 '_lib',
 '_libs',
 '_np_version_under1p14',
 '_np_version_under1p15',
 '_np_version_under1p16',
 '_np_version_under1p17',
 '_

__some methods that will be useful__

- head, tail

- describe

- info

- loc vs iloc?

- values

- renaming columns

- droping columns



In [45]:
# shape
## the first one is the number of rows
## the second is the number of columns

muj_df.shape

(6, 4)

In [48]:
#head

muj_df.head(3)

Unnamed: 0,ID,Name,Job,Years Employed
0,0,Bob Bobberty,Underwater Basket Weaver,13
1,1,Susan Smells,Salad Spinner,5
2,2,Alex Lastname,Productivity Manager,2


In [49]:
#tail

muj_df.tail(3)

Unnamed: 0,ID,Name,Job,Years Employed
3,3,Rudy P.,Being cool,55
4,4,Rudy G.,Being compared to Rudy P,50
5,5,Sir Wellington,Cheese Stacker,10


In [52]:
n = muj_df.Name

In [53]:
type(n)

pandas.core.series.Series

In [55]:
type(muj_df['Name'])

pandas.core.series.Series

In [56]:
type(muj_df[['Name']])

pandas.core.frame.DataFrame

In [58]:
list(muj_df.columns)

['ID', 'Name', 'Job', 'Years Employed']

In [61]:
muj_df.columns.to_list()

['ID', 'Name', 'Job', 'Years Employed']

In [64]:
muj_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6 entries, 0 to 5
Data columns (total 4 columns):
 #   Column          Non-Null Count  Dtype 
---  ------          --------------  ----- 
 0   ID              6 non-null      int64 
 1   Name            6 non-null      object
 2   Job             6 non-null      object
 3   Years Employed  6 non-null      int64 
dtypes: int64(2), object(2)
memory usage: 320.0+ bytes


In [66]:
muj_df.Name.describe()


count                 6
unique                6
top       Alex Lastname
freq                  1
Name: Name, dtype: object

In [67]:
muj_df

Unnamed: 0,ID,Name,Job,Years Employed
0,0,Bob Bobberty,Underwater Basket Weaver,13
1,1,Susan Smells,Salad Spinner,5
2,2,Alex Lastname,Productivity Manager,2
3,3,Rudy P.,Being cool,55
4,4,Rudy G.,Being compared to Rudy P,50
5,5,Sir Wellington,Cheese Stacker,10


In [70]:
muj_df.iloc[2:5]

Unnamed: 0,ID,Name,Job,Years Employed
2,2,Alex Lastname,Productivity Manager,2
3,3,Rudy P.,Being cool,55
4,4,Rudy G.,Being compared to Rudy P,50


In [72]:
muj_df.loc[2:4]

Unnamed: 0,ID,Name,Job,Years Employed
2,2,Alex Lastname,Productivity Manager,2
3,3,Rudy P.,Being cool,55
4,4,Rudy G.,Being compared to Rudy P,50


In [73]:
# Dictionary with list object in values
student_dict = {
    'name': ['Samantha', 'Alex', 'Dante'],
    'age': ['35', '17', '26'],
    'city': ['Houston', 'Seattle', 'New york']
}


students_df = pd.DataFrame(student_dict)

students_df.head()

Unnamed: 0,name,age,city
0,Samantha,35,Houston
1,Alex,17,Seattle
2,Dante,26,New york


In [75]:
students_df.index.to_list()

[0, 1, 2]

In [79]:
students_df.set_index('name', inplace = True)

In [80]:
students_df

Unnamed: 0_level_0,age,city
name,Unnamed: 1_level_1,Unnamed: 2_level_1
Samantha,35,Houston
Alex,17,Seattle
Dante,26,New york


In [82]:
students_df.loc['Samantha' : 'Dante']

Unnamed: 0_level_0,age,city
name,Unnamed: 1_level_1,Unnamed: 2_level_1
Samantha,35,Houston
Alex,17,Seattle
Dante,26,New york


In [83]:
students_df.iloc[1 : 2]

Unnamed: 0_level_0,age,city
name,Unnamed: 1_level_1,Unnamed: 2_level_1
Alex,17,Seattle
