# Introduction

This notebook is an introduction to data science using Python. The introduction focuses on help and documentation, and exploring data structure in Python. 

###  What is Data Science?
Data science is the practice of using data to try to understand and solve real-world problems. It is the process of finding hidden pattern from the raw/unstructured data. Data science requires programming and databases related skills, math, machine learning and statistics knowledge, and problem solving ability such as business understanding or substantive expertise in a domain of analysis at hand.

Generally, the following a roadmap to be data scientist (Source: [Data Camp](https://www.datacamp.com/community/tutorials/how-to-become-a-data-scientist)):
1. Get good at stats, math, and machine learning: 
   * Math especially linear algebra and calculus
   * Probability and statistics such as p-value, t-test, F-test, estimates, central tendency....etc 
   * Machine learning algorithms -- how they are organized
2. Learn to code: Python, R ...
3. Understand databases: SQL, ORACLE, MySQL, ....
4. Master data munging, visualization, and reporting
5. Level up with big data: Apache Spark, Map reduce, Hadoop ....
6. Get experience, PRACTICE, ......
7. Internship, bootcamp, or get a job
8. Follow and engage with the community


## Tools 
[Anaconda](https://www.anaconda.com/) is a data science and machine learning applications toolkit. It includes both Python and conda, and additionally bundles a suite of other pre-installed and ready to install packages geared toward scientific computing. There is also a smaller version called [Miniconda](https://docs.conda.io/en/latest/miniconda.html) which is a bootstrap version of Anaconda that includes only conda, Python, and some packages.

The following packages (modules) are among the most useful for data science and scientific computing: 
* [NumPy](https://numpy.org/):  the fundamental package for scientific computing in Python and provides efficient storage and computation for multi-dimensional data arrays.  
* [Pandas](https://pandas.pydata.org/): a fast, powerful, flexible and easy to use open source data analysis and manipulation tool, built on top of the Python programming language.
* [SciPy](https://www.scipy.org/about.html): a collection of open source software for scientific computing in Python.
* [Matplotlib](https://matplotlib.org/): a comprehensive library for creating static, animated, and interactive visualizations in Python.
* [Scikit-Learn](https://scikit-learn.org/stable/): provides simple and efficient tools for predictive data analysis; provides a uniform toolkit for applying common machine learning algorithms to data. 
* [IPython/Jupyter](https://ipython.org/): a powerful interactive shell and kernel for Jupyter Notebooks. It provides an enhanced terminal and an interactive notebook environment that is useful for exploratory analysis.  


In [1]:
# Checking version of Python--Option 1  (--- installed)
from platform import python_version 
print("The Python version is: ", python_version())

# Checking version of Python --Option2 (--- installed)
import platform 
print("The Python version is: ", platform.python_version())


The Python version is:  3.8.11
The Python version is:  3.8.11


##  Help and Documentation in IPython

In [2]:
# Using built-in 'help()' function ----- with access information  and print it
help(len)

Help on built-in function len in module builtins:

len(obj, /)
    Return the number of items in a container.



In [3]:
# Accessing documentation using '?' ----to explore documentation and other relevant information
# It can be used to explore any thing: built-in functions, methods, objects .....
len?

In [4]:
# List example
L = [7, 9, 11, 13]
L?

In [5]:
L.insert?

In [6]:
# Example function 
def square(a):
    """Return the square of a."""
    return a ** 2

In [7]:
# Accessing documentation
square?

In [8]:
#  Accessing source code using '??' --- it provides the info you get with '?' plus the 
# source code if it is implemented in Python; if not the same info as '?' 
square??

## Exploring Data Structure 
Data structures are the fundamental constructs that used to store, manage, organize, search and manipulate data in an efficient way depending on your need. 

Categories of Python data structures:

* Dictionaries, Maps,and Hash Tables
* Array Data Structures: list, tuple, array.array, string, bytes, bytearray
* Sets and multi-sets
* Stacks
* Queues

### Data Types
* Numeric types: int, float, complex
* Iterator types: generator
* Sequence types: lists, tuple, range
* Text sequence types: Strings -- str
* Binary sequence types: bytes, bytearray, memoryview  
* Set types: set 
* Mapping types: dictionary --dict 

In [9]:
# # https://towardsdatascience.com/jupyter-notebook-to-pdf-in-a-few-lines-3c48d68a7a63
# !pip install -U notebook-as-pdf

In [10]:
# !pyppeteer-install