# pandas
pandas is an open source, BSD-licensed library providing high-performance, easy-to-use data structures and data analysis tools for the Python.
+ Python has long been great for data munging and preparation, but less so for data analysis and modeling. pandas helps fill this gap, enabling you to carry out your entire data analysis workflow in Python without having to switch to a more domain specific language like R.

+ pandas does not implement significant modeling functionality outside of linear and panel regression.

### Library Highlights
+ A fast and efficient <b>DataFrame</b> object for data manipulation with integrated indexing;
+ Tools for <b>reading and writing data</b> between in-memory data structures and different formats: CSV and text files, Microsoft Excel, SQL databases, and the fast HDF5 format;
+ Intelligent <b>data alignment</b> and integrated handling of <b>missing data</b>.
+ Flexible <b>reshaping</b> and pivoting of data sets.
+ Aggregating or transforming data with a powerful <b>group</b> by engine allowing split-apply-combine operations on data sets.

### Data structures
<table><tr><th>Dimensions</th><th>Name</th><th>Description</th></tr>
    <tr><td>1</td><td>Series</td><td>1D labeled homogeneously-typed array</td></tr>
     <tr><td>2</td><td>DataFrame</td><td>General 2D labeled, size-mutable tabular structure with potentially heterogeneously-typed column</td></tr></table>
    
#### Why more than one data structure?
The best way to think about the pandas data structures is as flexible containers for lower dimensional data. For example, DataFrame is a container for Series, and Series is a container for scalars. We would like to be able to insert and remove objects from these containers in a dictionary-like fashion.

# Object creation

In [3]:
import numpy as np
import pandas as pd
# Creating a Series by passing a list of values, letting pandas create a default integer index:
s = pd.Series([1, 3, 5, np.nan, 6, 8])

In [6]:
# Creating a DataFrame by passing a NumPy array, with a datetime index and labeled columns:
dates = pd.date_range('20130101', periods=6)

df = pd.DataFrame(np.random.randn(6, 4), index=dates, columns=list('ABCD'))

In [9]:
# Creating a DataFrame by passing a dict of objects that can be converted to series-like.
df2 = pd.DataFrame({'A': 1.,
                    'B': pd.Timestamp('20130102'),
                    'C': pd.Series(1, index=list(range(4)), dtype='float32'),
                    'D': np.array([3] * 4, dtype='int32'),
                    'E': pd.Categorical(["test", "train", "test", "train"]),
                    'F': 'foo'})

In [None]:
# The columns of the resulting DataFrame have different dtypes.
# df2.dtypes, df2.shape, df2.head(), df2.tail(), df2.index, df2.columns, 

+ DataFrame.to_numpy() gives a NumPy representation of the underlying data. Note that this can be an expensive operation when your DataFrame has columns with different data types, which comes down to a fundamental difference between pandas and NumPy: <b>NumPy arrays have one dtype for the entire array, while pandas DataFrames have one dtype per column.</b> When you call DataFrame.to_numpy(), pandas will find the NumPy dtype that can hold all of the dtypes in the DataFrame. This may end up being object, which requires casting every value to a Python object.

In [17]:
# For df, our DataFrame of all floating-point values, DataFrame.to_numpy() is fast and doesn’t require copying data.
#df.to_numpy()

In [18]:
# For df2, the DataFrame with multiple dtypes, DataFrame.to_numpy() is relatively expensive.
# df2.to_numpy()

In [19]:
# describe() shows a quick statistic summary of your data:
#df.describe
# Transposing your data:
#df.T

In [20]:
# Sorting by an axis:
# df.sort_index(axis=1, ascending=False)

In [21]:
# Sorting by values:
#df.sort_values(by='B')

# Viewing data

In [1]:
# df.head()
# df.tail()

# Selection

## Getting
#### Selecting a single column, which yields a Series, equivalent to df.A:

```python
>>> df['A']                         # selection
```

<b> Selecting via [], which slices the rows.</b>
```python
>>> df[0:3]
```

<b>For getting a cross section using a label:</b>
```python
>>> df.loc[dates[0]]
```

<b>Selecting on a multi-axis by label:</b>
```python
>>> df.loc[:, ['A', 'B']]
```

</b>Showing label slicing, both endpoints are included:</b>
```python
>>> df.loc['20130102':'20130104', ['A', 'B']]
```

## Selection by position

<b>Select via the position of the passed integers:</b>
```python
>>> df.iloc[3]
```

<b>By integer slices, acting similar to numpy/python:</b>
```python
>>> df.iloc[3:5, 0:2]
```

<b>By lists of integer position locations, similar to the numpy/python style:</b>
```python
>>> df.iloc[[1, 2, 4], [0, 2]]
```

<b>For slicing rows explicitly:</b>
```python
>>> df.iloc[1:3, :]
```

<b>For slicing columns explicitly:</b>
```python
>>> df.iloc[:, 1:3]
```

<b>For getting fast access to a scalar (equivalent to the prior method):</b>
```python
>>> df.iat[1, 1]
```

## Boolean indexing

<b>Using a single column’s values to select data.</b>
```python
>>> df[df.A > 0]
```

<b>Selecting values from a DataFrame where a boolean condition is met.</b>
```python
>>> df[df > 0]
```

<b>Using the isin() method for filtering:</b>
```python
>>> df2 = df.copy()

>>> df2['E'] = ['one', 'one', 'two', 'three', 'four', 'three']

>>> df2[df2['E'].isin(['two', 'four'])]
```

## Setting

<b>Setting a new column automatically aligns the data by the indexes.</b>
```python
>>> s1 = pd.Series([1, 2, 3, 4, 5, 6], index=pd.date_range('20130102', periods=6))
>>> s1
    2013-01-02    1
    2013-01-03    2
    2013-01-04    3
    2013-01-05    4
    2013-01-06    5
    2013-01-07    6
    Freq: D, dtype: int64
>>> df['F'] = s1
```

<b>Setting values by label:</b>
```python
>>> df.at[dates[0], 'A'] = 0
```

<b>Setting values by position:</b>
```python
>>> df.iat[0, 1] = 0
```

<b>Setting by assigning with a NumPy array:</b>
```python
>>> df.loc[:, 'D'] = np.array([5] * len(df))
```

<b>The result of the prior setting operations.</b>
```python
>>> df
                   A         B         C  D    F
2013-01-01  0.000000  0.000000 -1.509059 -5  NaN
2013-01-02 -1.212112 -0.173215 -0.119209 -5 -1.0
2013-01-03 -0.861849 -2.104569 -0.494929 -5 -2.0
2013-01-04 -0.721555 -0.706771 -1.039575 -5 -3.0
2013-01-05 -0.424972 -0.567020 -0.276232 -5 -4.0
2013-01-06 -0.673690 -0.113648 -1.478427 -5 -5.0
```


In [4]:
# execute above examples here:
# since i am using random values result values can be different. Try to understand operations rather then matching values
dates = pd.date_range('20130101', periods=6)
df = pd.DataFrame(np.random.randn(6, 4), index=dates, columns=list('ABCD'))
df

Unnamed: 0,A,B,C,D
2013-01-01,0.206656,0.006286,-1.674489,-1.117912
2013-01-02,1.048613,-0.50873,-0.581579,1.562867
2013-01-03,1.647931,1.329333,-0.137115,1.526373
2013-01-04,0.17514,-1.095541,0.336882,0.321555
2013-01-05,-0.544268,-0.063995,0.623904,-1.90026
2013-01-06,0.539089,1.431677,-1.229075,-0.978669


In [None]:
# Missing Value Treatment
