# Week 4: An intro to Python, Pandas and Jupyter 

## Environment basics
- Class [Github repo](https://github.com/stiles/usc)
- Structure and workflow
- Clone repo using Github Desktop | [documentation](https://docs.github.com/en/desktop/contributing-and-collaborating-using-github-desktop/adding-and-cloning-repositories/cloning-a-repository-from-github-to-github-desktop)
- Set a directory structure locally and stick to it!

### Jupyter Lab interface basics
- Directory pane
- [Documentation](https://jupyterlab.readthedocs.io/en/stable/)
- Typing and executing code in cells
- Cell types
- Running/restarting a notebook
- Markdown/documentation
- Keyboard shortcuts
- Copy/paste cells

### Import Python tools
These are the Python libraries we will use to complete our work. Here we only need to import [Pandas](https://pandas.pydata.org/docs/getting_started/index.html#getting-started). 

In [1]:
import pandas as pd

In [2]:
### ask matt what this means, was confused last time

### Now what? 

#### Let's start with simple math

In [3]:
777 / 7

111.0

In [4]:
60 + 7

67

In [5]:
3 - 9

-6

#### Defining a variable

In [6]:
number = 100

In [7]:
print(number)

100


#### Conduct a simple data analysis

In [8]:
my_list = [2, 4, 6, 8, 10, 12, 14,16]

In [9]:
my_list

[2, 4, 6, 8, 10, 12, 14, 16]

In [10]:
my_series = pd.Series(my_list)

In [11]:
my_series

0     2
1     4
2     6
3     8
4    10
5    12
6    14
7    16
dtype: int64

In [12]:
# my_series = pd.Series(np.random.randint(10, 1000,size=100000000))

#### Descriptive statistics

Once the data becomes a Series, you can immediately run a wide range of [descriptive statistics](https://en.wikipedia.org/wiki/Descriptive_statistics). Let’s try a few.

In [13]:
my_series.sum()

72

#### Then find the maximum value in the next

In [14]:
my_series.max()

16

#### The minimum value in the next

In [15]:
my_series.min()

2

#### How about the average (also known as the mean)? Keep adding cells and calculating new statistics.

In [16]:
my_series.mean()

9.0

#### The median?

In [17]:
my_series.median()

9.0

#### The standard deviation?

In [18]:
my_series.std()

4.898979485566356

####  And all of the above, plus a little more about the distribution, in one simple command.

In [19]:
my_series.describe()

count     8.000000
mean      9.000000
std       4.898979
min       2.000000
25%       5.500000
50%       9.000000
75%      12.500000
max      16.000000
dtype: float64

---

## Import data

#### Read a CSV file with members of Congress

In [20]:
df_csv = pd.read_csv('../../data/raw/members_of_congress_117.csv')

FileNotFoundError: [Errno 2] No such file or directory: '../../data/raw/members_of_congress_117.csv'

#### Or an Excel file

In [None]:
df_excel = pd.read_excel('../../data/raw/members_of_congress_117_excel.xlsx')

#### Import from a URL

In [None]:
df_url = pd.read_csv('https://raw.githubusercontent.com/stiles/notebooks/master/congress/output/members_of_congress_117.csv')

#### Make a copy and assign a new variable

In [None]:
df = df_csv.copy()

---

## Understanding the dataframe

#### Use the `describe()` to get summary stats on any numerical columns

In [None]:
df.describe()

#### Use the `info()` method for data types and columns

In [None]:
df.info()

In [None]:
df.columns

In [None]:
df.columns = df.columns.str.upper()

In [None]:
###This changes all of the columns to upper case, .lower would make them lower

In [None]:
df

#### Reading one column, or "series"

In [None]:
df['LAST_NAME']

#### Counting values in categorical or string columns

In [None]:
df['PARTY'].value_counts()

---

## Interacting with the data

#### Use the `head()` method to see the first *n* rows

In [None]:
df.head()

#### Use the `tail()` method to see the first *n* rows

In [None]:
df.tail()

#### Sorting with the `sort_values()` method to find member with most seniority

In [None]:
df.sort_values('SENIORITY', ascending = True).head()

In [None]:
df.sort_values('SENIORITY', ascending = False).head()

#### Sorting with the `sort_values()` method to find member who's most liberal or conservative

In [None]:
df.sort_values('VOTES_WITH_PARTY_PCT' , ascending = True).head()

In [None]:
df.sort_values('VOTES_WITH_PARTY_PCT' , ascending = False).head()

---

## What questions would you ask of this dataset? 

In [None]:
df.groupby(['STATE'])['DW_NOMINATE'].mean()

In [None]:
df.groupby(['STATE'])['SENIORITY'].mean().sort_values(ascending = False)

In [None]:
df.groupby(['STATE'])['SENIORITY'].max().sort_values(ascending = False)

In [None]:
df.groupby(['STATE'])['SENIORITY'].median().sort_values(ascending = False)

In [None]:
df.groupby(['STATE'])['SENIORITY'].min().sort_values(ascending = False)

---

## Export

In [None]:
df.to_csv('../../data/processed/members_of_congress_117.csv', index=False)

In [None]:
# This exports it out as a CSV

In [None]:
df.to_excel('../../data/processed/members_of_congress_117.xlsx')