## Programming with Python for Data Science

The data from various tye of files can be read using below mentioned methods
    
    from sqlalchemy import create_engine
    engine = create_engine('sqlite:///:memory:')
    sql_dataframe  = pd.read_sql_table('my_table', engine, columns=['ColA', 'ColB'])
    
    xls_dataframe  = pd.read_excel('my_dataset.xlsx', 'Sheet1', na_values=['NA', '?'])
    
    json_dataframe = pd.read_json('my_dataset.json', orient='columns')
    
    csv_dataframe  = pd.read_csv('my_dataset.csv', sep=',')
    
    table_dataframe= pd.read_html('http://page.com/with/table.html')[0]
    
    Note the return type of .read_html(), it is a Python list of dataframes, one per HTML table found on the webpage.

sep : str, default ‘,’

    Delimiter to use. If sep is None, will try to automatically determine this. Separators longer than 1 character and different from '\s+' will be interpreted as regular expressions, will force use of the python parsing engine and will ignore quotes in the data. Regex example: '\r\t'

delimiter : str, default None

    Alternative argument name for sep.

header : int or list of ints, default ‘infer’

    Row number(s) to use as the column names, and the start of the data. Default behavior is as if set to 0 if no names passed, otherwise None. Explicitly pass header=0 to be able to replace existing names. The header can be a list of integers that specify row locations for a multi-index on the columns e.g. [0,1,3]. Intervening rows that are not specified will be skipped (e.g. 2 in this example is skipped). Note that this parameter ignores commented lines and empty lines if skip_blank_lines=True, so header=0 denotes the first line of data rather than the first line of the file.

names : array-like, default None

    List of column names to use. If file contains no header row, then you should explicitly pass header=None. Duplicates in this list are not allowed unless mangle_dupe_cols=True, which is the default.

index_col : int or sequence or False, default None

    Column to use as the row labels of the DataFrame. If a sequence is given, a MultiIndex is used. If you have a malformed file with delimiters at the end of each line, you might consider index_col=False to force pandas to _not_ use the first column as the index (row names)

skipinitialspace : boolean, default False
    
    Skip spaces after delimiter.

skiprows : list-like or integer, default None
    
    Line numbers to skip (0-indexed) or number of lines to skip (int) at the start of the file

na_values : scalar, str, list-like, or dict, default None
    
    Additional strings to recognize as NA/NaN. If dict passed, specific per-column NA values. By default the following values are interpreted as NaN: ‘’, ‘#N/A’, ‘#N/A N/A’, ‘#NA’, ‘-1.#IND’, ‘-1.#QNAN’, ‘-NaN’, ‘-nan’,

thousands : str, default None
    
    Thousands separator

decimal : str, default ‘.’
    
    Character to recognize as decimal point (e.g. use ‘,’ for European data).

To convert any dataframe to any other type of file

    my_dataframe.to_sql('table', engine)
    my_dataframe.to_excel('dataset.xlsx')
    my_dataframe.to_json('dataset.json')
    my_dataframe.to_csv('dataset.csv')

The .loc[] method selects by column label, 

.iloc[] selects by column index, and 

.ix[] can be used whenever you want to use a hybrid approach of either.

#### Produces a series object:
    df.recency
    df['recency']
    df.loc[:, 'recency']
    df.iloc[:, 0]
    df.ix[:, 0]

#### Produces a dataframe object:
    df[['recency']]
    df.loc[:, ['recency']]
    df.iloc[:, [0]]

The difference between the two is, if a pair of big brackets is used inside the .loc operator the return type is a Data frame and if not a Series


In pandas the logical operators 'or' and 'and' are not available as they create ambiguity instead logical bitwise operators can be used i.e. | and &

#### Textual Categorical-Features

In [41]:
ordered_satisfaction = ['Very Unhappy', 'Unhappy', 'Neutral', 'Happy', 'Very Happy']
df = pd.DataFrame({'satisfaction':['Mad', 'Happy', 'Unhappy', 'Neutral']})
df.satisfaction = df.satisfaction.astype("category",
                  ordered=True,
                  categories=ordered_satisfaction
).cat.codes
df

Unnamed: 0,satisfaction
0,-1
1,3
2,1
3,2


If the column data in the data frame is ordinal and it is intended to classfy them in a order then change the data type to category.

The optional arguments are ordered and categories
    
    If order is true then it will arrange the return type in incerasing order.
    categories will take a sequence of odering

In [42]:
import pandas as pd

df = pd.DataFrame({'vertebrates':['Bird',
'Bird',
'Mammal',
'Fish',
'Amphibian',
'Reptile',
'Mammal']})

df.vertebrates = df.vertebrates.astype("category").cat.codes
df

Unnamed: 0,vertebrates
0,1
1,1
2,3
3,2
4,0
5,4
6,3


In [43]:
import pandas as pd

df = pd.DataFrame({'vertebrates':['Bird',
'Bird',
'Mammal',
'Fish',
'Amphibian',
'Reptile',
'Mammal']})

df['new_vertebrates'] = df.vertebrates.astype("category").cat.codes
df

Unnamed: 0,vertebrates,new_vertebrates
0,Bird,1
1,Bird,1
2,Mammal,3
3,Fish,2
4,Amphibian,0
5,Reptile,4
6,Mammal,3


In [45]:
df = pd.DataFrame({'vertebrates':['Bird',
'Bird',
'Mammal',
'Fish',
'Amphibian',
'Reptile',
'Mammal']})
df = pd.get_dummies(df,columns=['vertebrates'])
df

Unnamed: 0,vertebrates_Amphibian,vertebrates_Bird,vertebrates_Fish,vertebrates_Mammal,vertebrates_Reptile
0,0.0,1.0,0.0,0.0,0.0
1,0.0,1.0,0.0,0.0,0.0
2,0.0,0.0,0.0,1.0,0.0
3,0.0,0.0,1.0,0.0,0.0
4,1.0,0.0,0.0,0.0,0.0
5,0.0,0.0,0.0,0.0,1.0
6,0.0,0.0,0.0,1.0,0.0


These newly created features are called boolean features because the only values they can contain are either 0 for non-inclusion, or 1 for inclusion. Pandas .get_dummies() method allows you to completely replace a single, nominal feature with multiple boolean indicator features. 

In [46]:
import sys
sys.version

'2.7.12 |Anaconda 4.2.0 (32-bit)| (default, Jun 29 2016, 11:42:13) [MSC v.1500 32 bit (Intel)]'

#### Pure Textual Features

In [53]:
from sklearn.feature_extraction.text import CountVectorizer

corpus = [
"Authman ran faster than Harry because he is an athlete.",
"Authman and Harry ran faster and faster."]

bow = CountVectorizer()
X = bow.fit_transform(corpus) # Sparse Matrix

words = bow.get_feature_names()
#['an', 'and', 'athlete', 'authman', 'because', 'faster', 'harry', 'he', 'is', 'ran', 'than']

print (X.toarray())
print(words)

[[1 0 1 1 1 1 1 1 1 1 1]
 [0 2 0 1 0 2 1 0 0 1 0]]
[u'an', u'and', u'athlete', u'authman', u'because', u'faster', u'harry', u'he', u'is', u'ran', u'than']


In the above example 

    corpus is the given sentence
    bow is the bag of words
    features are the names of the columns
    x is the sparse matrix that is created to save memory, i.e. if actual words are used to create the feature then the dataframe will be huge

#### Graphical Features

from scipy import misc #Load the image up

img = misc.imread('image.png')

img = img[::2, ::2] #Is the image too big? Resample it down by an order of magnitude

X = (img / 255.0).reshape(-1) #Scale colors from (0-255) to (0-1), then reshape to 1D array per pixel, e.g. grayscale if you had
color images and wanted to preserve all color channels, use .reshape(-1,3)

#### Wrangling Your Data