## Rectangular Data
Spreadsheet or database table

General Term for 2D matrix with rows indicating records (cases) and columns (features)

**DataFrame** is the specific format in Python and R

**Feature:** column within a table ---  **Records:** row within a table

Features are used to predict a **target** (outcome)

## Nonrectangular Data
#### Time Series Data
#### Spatial Data  - mapping and location analytics
#### Graph (network) Data
Represent physical, social, or abstract relationships - social networks

**Key Ideas**
- The basic data structure in data science is a rectangular matrix in which rows are records and columns are variables (features). 
- Terminology can be confusing; there are a variety of synonyms arising from the different disciplines that contribute to data science (statistics, computer science, and information technology).

## Estimates of Location
Variables with measured or count data might have thousands of distinct values. A basic step in exploring your data is getting a “typical value” for each feature (variable): an estimate of where most of the data is located (i.e., its central tendency).

Key Terms:
- **Trimmed mean** The average of all values after dropping a fixed number of extreme values. Synonym truncated mean 
    - eliminates influence of exttreme values
    - Ex. used in figure skating judginig , remove top and bottom judge scores
    - widely used - preferred over regular mean (tripe top and bottom 10%)

- **Weighted mean** The sum of all values times a weight divided by the sum of the weights. 
    - Two Motivations:
        - Some values are intrinsically more variable than others, and highly variable observations are given a lower weight. For example, if we are taking the average from multiple sensors and one of the sensors is less accurate, then we might downweight the data from that sensor
        - data collected does not equally represent the different groups that we are interested in measuring. For example, because of the way an online experiment was conducted, we may not have a set of data that accurately reflects all groups in the user base. To correct that, we can give a higher weight to the values from the groups that were underrepresented
- **Weighted median** The value such that one-half of the sum of the weights lies above and below the sorted data. 
    - sort data after applying the weights to get weighted median
    - more robust to outliers
  
- **Robust** Not sensitive to extreme values. Synonym resistant 

#### Python Calcs
Mean: `state['Pop'].mean()`

Trim Mean: `trim_mean(state['Pop'], 0.1)`

Median: `state['Pop'].median()`

Weighted Meand and Median:
- `np.average(state['Murder.Rate'], weights=state['Population'])`
- `wquantiles.median(state['Murder.Rate'], weights=state['Population'])`


## Estimates of Variability 

Standard deviation easier to interpret than variance since it's on same scale as original data

#### Degrees of Freedom - n or n-1
if use n - underestimate true value of variance and the standard deviation in pop
biased estimate - divide by n-1 becomes unbiased

## Estimates Based on Percentiles
Look at spread of sorted data
For very large datasets - computing exact percentiles very computationally expensive since involves sorting
Software use special algorithms to get approx percentile within certain accuracy

**Python Calcs**

Quantiles: `state['Murder.Rate'].quantile([0.05,0.25, 0.5, 0.75, 0.95])`

#### Histograms:

`pandas.cut` creates a series that maps values into segments (this case 10)

`binnedPop = pd.cut(state['Population'], 10)`

`binnedPop.value_counts()`

Pandas supports histograms for data frames with `DataFrame.plot.hist`

`ax = (state['Pop'] / 1_000_000).plot.hist(figsize=(4,4))`

`ax.set_xlabel('Pop (millions)')`

#### Density Plot
Shows distribution of data values as continuous line => smoothed histogram

Pandas provides `density` method to create density plot

`ax=state['Murder.Rate'].plot.hist(density=True, xlim=[0,12], bins=range(1,12))`

`state['Murder.Rate'].plot.density(ax=ax)` ** ax=ax Plot functs take optional axis argument, cause plot to be added to same graph

`ax.set_xlabel('Murder Rate (per 100,000)')`

#### Key Ideas
- freq histogram plots freq counts on y-axis and var vals on x-axis
    - gives sense of distribution of data at a glance
- boxplot gives quick sense of distribution
    - used in side by side displays to compare distributions
- density plot is smoothed version of histogram
    - requires a function to est. plot based on data
    

## Exploring Binary and Categorical Data (pp. 27)

**Mode** simple summary stat for categorical data and generally not used for numeric

#### Expected Value
A special type of categorical data is data in which the categories represent or can be mapped to discrete values on the same scale. 

Form of weighted mean

Ex. A marketer for a new cloud technology, offers two levels of service, one priced at $ 300 / month and another at $50/month. The marketer offers free webinars to generate leads, and the firm figures that 5% of the attendees will sign up for the $300 service, 15\% will sign up for the $50 service, and 80% will not sign up for anything.This data can be summed up, for financial purposes, in a single “expected value,” which is a form of weighted mean, in which the weights are probabilities.

Calculation:
- multiply each outcome by its probability of occurence
- sum these values
- EV = 0.05x300 + 0.15x50 + 0.8x0 = 22.5

## Correlation
Positive Correlation: x increases and so does y
Negative Correlation: x increases and y decreases

**Correlation Coefficient:** extent to which numeric vars are associated with one another (-1 to 1)
- Calculation: multiply deviations from mean for var 1 times those for var 2 / product of std devs
- if association not linear - cor coef may not be useful

**Correlation Matrix:** table where vars are shown on both rows and cols, cells are correlations b/w vars
- Check pp. 32 for example of python code
- `sklearn.covariance` offer many alternatives - corr coef sensitive to outliers

**Scatterplot:** plot in which x-axis value of one variable and y-axis value of another

`ax = telecom.plot.scatter(x='T', y='VZ', figsive=(4,4), marker='$\u25EF$')`

`ax.set_xlabel('ATT')`

`ax.set_ylabel('VERIZON')`

`ax.axhline(0, color='grey', lw=1)`

`ax.axvline(0, color='grey', lw=1)`





## Exploring 2 or more vars
**Contingency table:** A tally of counts between two or more categorical variables. 

**Hexagonal binning:** A plot of two numeric variables with the records binned into hexagons. 

**Contour plot:** A plot showing the density of two numeric variables like a topographical map. 

**Violin plot:**  Similar to a boxplot but showing the density estimate.