## Libraries and Pandas



__Our goals today are to be able to:__

- Identify and import Python modules and packages (libraries)
- Investigate table data in Pandas
- Manipulate Pandas DataFrames and Series

## Libraries (Packages)

### Terminology

![mod2](img/modules2.png)



### Terminology

![packages3](img/packages3.png)

### pip & the Python Package Index

[Python Package Index](https://pypi.org/)

<img src="img/pypi_packages.png" width=600>

__You can also write your own modules__

Make your own modules
![pipmod](img/import_modules.png)

![pippack](img/package_redo.png)

## Pandas

<img src="https://cdn-images-1.medium.com/max/1600/1*9IU5fBzJisilYjRAi-f55Q.png" width=600>  

__Why not spreadsheets?__

[5 and Half Reasons to Ditch the Spreadsheet](https://lucidmanager.org/spreadsheets-for-data-science/)

### Installing and Using Pandas

In [1]:
import pandas 

In [2]:
pandas.__version__

'1.0.3'

In [3]:
## Why pandas
pd?

Object `pd` not found.


In [4]:
## convention

import pandas as pd 

### Main Data Structures in Pandas: Series, DataFrame and Index

#### Series

In [5]:
data = pd.Series([0.25, 0.5, 0.75, 1.0])
data

0    0.25
1    0.50
2    0.75
3    1.00
dtype: float64

In [6]:
data.values

array([0.25, 0.5 , 0.75, 1.  ])

In [7]:
data.index

RangeIndex(start=0, stop=4, step=1)

In [8]:
data[3]

1.0

In [9]:
data = pd.Series([0.25, 0.5, 0.75, 1.0],
                 index=['a', 'b', 'c', 'd'])
data

a    0.25
b    0.50
c    0.75
d    1.00
dtype: float64

In [10]:
data['b']

0.5

In [11]:
population_dict = {'California': 38332521,
                   'Texas': 26448193,
                   'New York': 19651127,
                   'Florida': 19552860,
                   'Illinois': 12882135}
population = pd.Series(population_dict)
population

California    38332521
Texas         26448193
New York      19651127
Florida       19552860
Illinois      12882135
dtype: int64

[For more on Series](https://github.com/jakevdp/PythonDataScienceHandbook/blob/master/notebooks/03.01-Introducing-Pandas-Objects.ipynb)

#### Pandas

The next fundamental structure in Pandas is the DataFrame. Like the Series object discussed in the previous section, the DataFrame can be thought of either as a generalization of a NumPy array, or as a specialization of a Python dictionary. If a Series is an analog of a one-dimensional array with flexible indices, a DataFrame is an analog of a two-dimensional array with both flexible row indices and flexible column names.

In [12]:


area_dict = {'California': 423967, 'Texas': 695662, 'New York': 141297,
             'Florida': 170312, 'Illinois': 149995}
area = pd.Series(area_dict)
area


states = pd.DataFrame({'population': population,
                       'area': area})
states

Unnamed: 0,population,area
California,38332521,423967
Texas,26448193,695662
New York,19651127,141297
Florida,19552860,170312
Illinois,12882135,149995


[Difference between Dataframe and Series](https://stackoverflow.com/questions/26047209/what-is-the-difference-between-a-pandas-series-and-a-single-column-dataframe)

### Importing and Reading Data with Pandas

In [13]:
## Let's check the current directory first
%pwd

'/Users/muratguner/Documents/lectures/dc-ds-060120/mod-1/day-3/second_session'

In [14]:
## Let's see the files in the current directory
%ls -la

total 56
drwxr-xr-x   7 muratguner  staff    224 Jun  4 14:01 [1m[36m.[m[m/
drwxr-xr-x   9 muratguner  staff    288 Jun  4 14:00 [1m[36m..[m[m/
-rw-r--r--@  1 muratguner  staff   6148 Jun  4 14:00 .DS_Store
drwxr-xr-x   2 muratguner  staff     64 Jun  4 14:01 [1m[36m.ipynb_checkpoints[m[m/
-rw-r--r--   1 muratguner  staff  17621 Jun  4 13:59 Libraries and Pandas.ipynb
drwxr-xr-x   4 muratguner  staff    128 Mar  8 21:06 [1m[36mdata[m[m/
drwxr-xr-x  12 muratguner  staff    384 Mar  8 21:06 [1m[36mimg[m[m/


In [15]:
import pandas as pd

muj = pd.read_csv('data/made_up_jobs.csv', )


We can read a lot of different types of files with pandas: Some examples might be: read_excel, read_html, ect.

In [16]:
type(pd)

module

In [17]:
## Let's take a look at the attributes of Pandas module
dir(pd)

['BooleanDtype',
 'Categorical',
 'CategoricalDtype',
 'CategoricalIndex',
 'DataFrame',
 'DateOffset',
 'DatetimeIndex',
 'DatetimeTZDtype',
 'ExcelFile',
 'ExcelWriter',
 'Float64Index',
 'Grouper',
 'HDFStore',
 'Index',
 'IndexSlice',
 'Int16Dtype',
 'Int32Dtype',
 'Int64Dtype',
 'Int64Index',
 'Int8Dtype',
 'Interval',
 'IntervalDtype',
 'IntervalIndex',
 'MultiIndex',
 'NA',
 'NaT',
 'NamedAgg',
 'Period',
 'PeriodDtype',
 'PeriodIndex',
 'RangeIndex',
 'Series',
 'SparseDtype',
 'StringDtype',
 'Timedelta',
 'TimedeltaIndex',
 'Timestamp',
 'UInt16Dtype',
 'UInt32Dtype',
 'UInt64Dtype',
 'UInt64Index',
 'UInt8Dtype',
 '__builtins__',
 '__cached__',
 '__doc__',
 '__docformat__',
 '__file__',
 '__getattr__',
 '__git_version__',
 '__loader__',
 '__name__',
 '__package__',
 '__path__',
 '__spec__',
 '__version__',
 '_config',
 '_hashtable',
 '_is_numpy_dev',
 '_lib',
 '_libs',
 '_np_version_under1p14',
 '_np_version_under1p15',
 '_np_version_under1p16',
 '_np_version_under1p17',
 '_

__some methods that will be useful__

- head, tail

- describe

- info

- loc vs iloc?

- values

- renaming columns

- droping columns

