Skip to content

jswong65/Machine_Learning_Nanodegree

Repository files navigation

Udacity Machine Learning Nanodegree

Project implementation for Udacity Machine Learning Nanodegree. These projects covers different aspects of machine learning, including Supervised Learning, Unsupervised Learning, Reinforcement Learning, Model Evaluation & Validation, etc.

Several python data analytic packages are used for the project implementation.

  • Numpy: Performs numerical operations.
  • Pandas: Data I/O, manipulation, and visualization.
  • Matplotlib, seaborn: Data visualization
  • scikit-learn: Builds, trains, and tests machine learning models.

Many datasets used in these projects can be found on UC Irvine Machine Learning Repository

Project Description
titanic_survival_exploratio An Intro project to Machine Learning. Exploring various variables that can be applied to predict the survival rate of Titanic passengers, including socio-economic class, gender, age, fare, etc. The results implies gender, age, and socio-economic class can be the important variables for prediction.
boston_housing Model Evaluation & Validation. The goal of this project is Predicting Boston Housing Prices.
  • Apply DecisionTreeRegressor to predict the housing prices.
  • Evaluate a model with R-squared score, the learning curve and the model complexity curve - Bias-Variance Trade-Off.
  • Use grid search, and K-fold cross-validation to find the parameters for optimizing a prediction model.
finding_donors Supervised Learning. The goal of this project is Finding Donors for Charity.
  • Data Preprocessing
    • Log transformation for skewed continuous variables
    • Data normalization for numerical variables (MinMaxScaler)
    • One-hot encoding for categorical variables (pandas.get_dummies)
  • Train, evaluate, and compare three different classifiers, including KNeighborsClassifier, RandomForestClassifier (bagging), GradientBoostingClassifier (boosting) with both accuracy and F-beta-score.
  • Use grid search and cross-validation to find the parameters for model optimization.
  • Use principal component analysis (PCA) to reduce the dimensions of the data
customer_segments Unsupervised Learning: The goal of this project is Creating Customer Segments.
  • Feature Exploration
    • Use box plot and histogram to examine the distribution of individual variables
    • Leverage a matrix of scatter plot and a heatmap to study correlation between variables
    • Apply multiple coordinate to investigate relationships between multiple variables
  • Data Preprocessing
    • Perform feature scaling (using natural logarithm) to reduce the skewness of highly skewed data
    • Apply Tukey's method to identify the outliers to be removed
  • Compare the K-means clustering and Gaussian mixture model (GMM) for data clustering.
  • Apply GMM to perform data clustering, and leverage silhouette coefficient as well as Bayesian information criterion (BIC) to choose the number of clusters.
smartcab Reinforcement Learning: The goal of this project is Training a Smartcab to Drive
  • Apply Q-Learning to teach a cab to drive safely and efficiently in a simulation.
  • Appropriate features were identified for modeling the Smartcab in the environment (build a state)
  • Rewards and punishments were attached to different outcomes to teach the cab to reach the destination as soon as possible without causing an accident.