# Pandas Data Ingestion & Inspection

### Review of Pandas Dataframes

Get Data in and review
- Pandas is library for data analysis
- dataframe, tabular data structure with rows and columns
- Timeseries for date/time index
- df.iloc[:5, :]
- df.iloc[:-5, :]   fifth last row using negative
- df.head(10) - first 10 rows
- df.tail() - last 5 rows
- df.info - summary information.  Type of Index, Columns and Dtypes
- Broadcasting
- extract single column df['column'] returns a Series
- df.values - on a single column returns a Numpy Array of values

Panda Series - 1D Labelled NumPy array
DataFrame is a 2D labelled array whose columns are Series




In [1]:
import pandas as pd
import numpy as np



In [None]:
# Assigning scalar value to column slice broadcasts value to each row
# Assign NaN to every third element
# every third row, assign NaN to the Last Column

AAPL.iloc[::3, -1] = np.nan

In [None]:
low = AAPL['low']

# type returns Series
type(low)

low.head()

# extract numerical entries and form a Numpy Array
lows = low.values

# type returns Numpy Array
type(lows)

#### NumPy and pandas working together

Pandas depends upon and interoperates with NumPy, the Python library for fast numeric array computations. For example, you can use the DataFrame attribute .values to represent a DataFrame df as a NumPy array. You can also pass pandas data structures to NumPy methods. In this exercise, we have imported pandas as pd and loaded world population data every 10 years since 1960 into the DataFrame df. This dataset was derived from the one used in the previous exercise.

Your job is to extract the values and store them in an array using the attribute .values. You'll then use those values as input into the NumPy np.log10() method to compute the base 10 logarithm of the population values. Finally, you will pass the entire pandas DataFrame into the same NumPy np.log10() method and compare the results.

In [9]:
# Creating df from scratch

data_dict = {'Year':[1960, 1970, 1980, 1990, 2000, 2010],
           'Total Population':[3.034971, 3.684823, 4.436590, 5.282716, 6.115974, 6.924283]}

df = pd.DataFrame(data_dict)

In [10]:
# Import numpy
import numpy as np

# Create array of DataFrame values: np_vals
np_vals = df.values

# Create new array of base 10 logarithm values: np_vals_log10
np_vals_log10 = np.log10(np_vals)

# Create array of new DataFrame by passing df to np.log10(): df_log10
df_log10 = np.log10(df)

# Print original and new data containers
[print(x, 'has type', type(eval(x))) for x in ['np_vals', 'np_vals_log10', 'df', 'df_log10']]

np_vals has type <class 'numpy.ndarray'>
np_vals_log10 has type <class 'numpy.ndarray'>
df has type <class 'pandas.core.frame.DataFrame'>
df_log10 has type <class 'pandas.core.frame.DataFrame'>


[None, None, None, None]

Wonderful work! As a data scientist, you'll frequently interact with NumPy arrays, pandas Series, and pandas DataFrames, and you'll leverage a variety of NumPy and pandas methods to perform your desired computations. Understanding how NumPy and pandas work together will prove to be very useful.

# Building Dataframes from Scratch

- pd.read_csv('file', index_col=0)
- create df from Dictionaries
- create df from Lists
- Broadcasting - technique for Numpy and Pandas
    - use to broadcast new columns
    - saves tim in generating long lists, arrays, or columns without loops
    - number, strings
- rename Index - df.index = ['index names']
- rename columns - df.columns = ['col names']

In [11]:

data_dict = {'Year':[1960, 1970, 1980, 1990, 2000, 2010],
           'Total Population':[3.034971, 3.684823, 4.436590, 5.282716, 6.115974, 6.924283]}

df = pd.DataFrame(data_dict)

In [13]:
# Building Dataframes from lists

cities = ['Austin', 'Dallas', 'Austin', 'Dallas']
signups = [7, 12, 3, 5]
visitors = [139, 237, 326, 456]
weekdays = ['Sun', 'Sun', 'Mon', 'Mon']

list_labels = ['city', 'signups', 'visitors', 'weekday']
list_cols = [cities, signups, visitors, weekdays]

In [15]:
# notice list_cols is a list of lists

list_cols

[['Austin', 'Dallas', 'Austin', 'Dallas'],
 [7, 12, 3, 5],
 [139, 237, 326, 456],
 ['Sun', 'Sun', 'Mon', 'Mon']]

In [16]:
# Python List and Zip functions
# creates a List of Tuples

zipped = list(zip(list_labels, list_cols))

In [17]:
print(zipped)

[('city', ['Austin', 'Dallas', 'Austin', 'Dallas']), ('signups', [7, 12, 3, 5]), ('visitors', [139, 237, 326, 456]), ('weekday', ['Sun', 'Sun', 'Mon', 'Mon'])]


In [18]:
data = dict(zipped)

users = pd.DataFrame(data)

In [21]:
users

Unnamed: 0,city,signups,visitors,weekday
0,Austin,7,139,Sun
1,Dallas,12,237,Sun
2,Austin,3,326,Mon
3,Dallas,5,456,Mon


In [24]:
#### Broadcasting - new cols built on the fly

users['fees'] = 0

In [25]:
users

Unnamed: 0,city,signups,visitors,weekday,fees
0,Austin,7,139,Sun,0
1,Dallas,12,237,Sun,0
2,Austin,3,326,Mon,0
3,Dallas,5,456,Mon,0


In [30]:
df.columns = ['new1', 'new2']

df

Unnamed: 0,new1,new2
0,3.034971,1960
1,3.684823,1970
2,4.43659,1980
3,5.282716,1990
4,6.115974,2000
5,6.924283,2010


In [33]:
df.index = ['a', 'b', 'c', 'd', 'e', 'f']

df

Unnamed: 0,new1,new2
a,3.034971,1960
b,3.684823,1970
c,4.43659,1980
d,5.282716,1990
e,6.115974,2000
f,6.924283,2010
