## Libraries and Pandas



__Our goals today are to be able to:__

- Identify and import Python modules and packages (libraries)
- Investigate table data in Pandas
- Manipulate Pandas DataFrames and Series

## Libraries (Packages)

### Terminology

![mod2](img/modules2.png)



### Terminology

![packages3](img/packages3.png)

### pip & the Python Package Index

[Python Package Index](https://pypi.org/)

<img src="img/pypi_packages.png" width=600>

__You can also write your own modules__

Make your own modules
![pipmod](img/import_modules.png)

![pippack](img/package_redo.png)

## Pandas

<img src="https://cdn-images-1.medium.com/max/1600/1*9IU5fBzJisilYjRAi-f55Q.png" width=600>  

__Why not spreadsheets?__

[5 and Half Reasons to Ditch the Spreadsheet](https://lucidmanager.org/spreadsheets-for-data-science/)

### Installing and Using Pandas

In [5]:
import pandas 

In [6]:
pandas.__version__

'0.25.1'

In [7]:
## Why pandas
pd?

Object `pd` not found.


In [8]:
## convention

import pandas as pd 

### Main Data Structures in Pandas: Series, DataFrame and Index

#### Series

In [10]:
data = pd.Series([0.25, 0.5, 0.75, 1.0])
type(data)

pandas.core.series.Series

In [11]:
data.values

array([0.25, 0.5 , 0.75, 1.  ])

In [16]:
type(data.index)

pandas.core.indexes.base.Index

In [8]:
data[3]

1.0

In [18]:
data = pd.Series([0.25, 0.5, 0.75, 1.0],
                 index=['a', 'a', 'c', 'd'])
data

a    0.25
a    0.50
c    0.75
d    1.00
dtype: float64

In [19]:
data['a']

a    0.25
a    0.50
dtype: float64

In [22]:
population.reset_index()
paopulation.rename('')

Unnamed: 0,index,0
0,California,38332521
1,Texas,26448193
2,New York,19651127
3,Florida,19552860
4,Illinois,12882135


In [23]:
population_dict = {'California': 38332521,
                   'Texas': 26448193,
                   'New York': 19651127,
                   'Florida': 19552860,
                   'Illinois': 12882135}
population = pd.Series(population_dict)
population

California    38332521
Texas         26448193
New York      19651127
Florida       19552860
Illinois      12882135
dtype: int64

[For more on Series](https://github.com/jakevdp/PythonDataScienceHandbook/blob/master/notebooks/03.01-Introducing-Pandas-Objects.ipynb)

#### Pandas

The next fundamental structure in Pandas is the DataFrame. Like the Series object discussed in the previous section, the DataFrame can be thought of either as a generalization of a NumPy array, or as a specialization of a Python dictionary. If a Series is an analog of a one-dimensional array with flexible indices, a DataFrame is an analog of a two-dimensional array with both flexible row indices and flexible column names.

In [28]:


area_dict = {'California': 423967, 'Texas': 695662, 'New York': 141297,
             'Florida': 170312, 'Illinois': 149995}
area = pd.Series(area_dict)
area


states = pd.DataFrame({'population': population,
                       'area': area})
states

newstates=states.loc[states['population']/states['area']]
newstates.head()

SyntaxError: invalid syntax (<ipython-input-28-b5248aedb9f3>, line 11)

[Difference between Dataframe and Series](https://stackoverflow.com/questions/26047209/what-is-the-difference-between-a-pandas-series-and-a-single-column-dataframe)

### Importing and Reading Data with Pandas

In [38]:
## Let's check the current directory first
%pwd

'C:\\Users\\shawj\\fis\\dc-ds-060120\\mod-1\\day-3\\second_session'

In [39]:
## Let's see the files in the current directory
%ls -la

 Volume in drive C is Windows
 Volume Serial Number is 1E4F-10C4

 Directory of C:\Users\shawj\fis\dc-ds-060120\mod-1\day-3\second_session



File Not Found


In [43]:
import pandas as pd

muj_df = pd.read_csv('data/made_up_jobs.csv', )
muj_df.head(30)

Unnamed: 0,ID,Name,Job,Years Employed
0,0,Bob Bobberty,Underwater Basket Weaver,13
1,1,Susan Smells,Salad Spinner,5
2,2,Alex Lastname,Productivity Manager,2
3,3,Rudy P.,Being cool,55
4,4,Rudy G.,Being compared to Rudy P,50
5,5,Sir Wellington,Cheese Stacker,10


csv: 
-If there is a text then csv may be a problem
- no standardized formatting
- natural comma may pose a problem

We can read a lot of different types of files with pandas: Some examples might be: read_excel, read_html, ect.

In [41]:
muj_df.shape

(6, 4)

In [31]:
## Let's take a look at the attributes of Pandas module
dir(pd)

['Categorical',
 'CategoricalDtype',
 'CategoricalIndex',
 'DataFrame',
 'DateOffset',
 'DatetimeIndex',
 'DatetimeTZDtype',
 'ExcelFile',
 'ExcelWriter',
 'Float64Index',
 'Grouper',
 'HDFStore',
 'Index',
 'IndexSlice',
 'Int16Dtype',
 'Int32Dtype',
 'Int64Dtype',
 'Int64Index',
 'Int8Dtype',
 'Interval',
 'IntervalDtype',
 'IntervalIndex',
 'MultiIndex',
 'NaT',
 'NamedAgg',
 'Panel',
 'Period',
 'PeriodDtype',
 'PeriodIndex',
 'RangeIndex',
 'Series',
 'SparseArray',
 'SparseDataFrame',
 'SparseDtype',
 'SparseSeries',
 'Timedelta',
 'TimedeltaIndex',
 'Timestamp',
 'UInt16Dtype',
 'UInt32Dtype',
 'UInt64Dtype',
 'UInt64Index',
 'UInt8Dtype',
 '__builtins__',
 '__cached__',
 '__doc__',
 '__docformat__',
 '__file__',
 '__git_version__',
 '__loader__',
 '__name__',
 '__package__',
 '__path__',
 '__spec__',
 '__version__',
 '_config',
 '_hashtable',
 '_lib',
 '_libs',
 '_np_version_under1p14',
 '_np_version_under1p15',
 '_np_version_under1p16',
 '_np_version_under1p17',
 '_tslib',
 '_

__some methods that will be useful__

- head, tail

- describe

- info

- loc vs iloc?

- values

- renaming columns

- dropping columns



In [44]:
muj_df.columns.to_list()

['ID', 'Name', 'Job', 'Years Employed']

In [None]:
states

In [48]:
states.apply(lambda row: row.population)

AttributeError: ("'Series' object has no attribute 'population'", 'occurred at index population')

In [50]:
df = pd.DataFrame({'month': ['a','b','c','d'],
                   'year': [2012, 2014, 2013, 2014],
                   'sale': [55, 40, 84, 31]})

In [59]:
df=df.set_index(['month'])

In [60]:
df.loc[1]

TypeError: cannot do label indexing on <class 'pandas.core.indexes.base.Index'> with these indexers [1] of <class 'int'>

In [61]:
df.index

Index(['a', 'b', 'c', 'd'], dtype='object', name='month')